science gacp datasets: Topics by WorldWideScience.org

Sample records for science gacp datasets

Extension and statistical analysis of the GACP aerosol optical thickness record

Science.gov (United States)

Geogdzhayev, Igor V.; Mishchenko, Michael I.; Li, Jing; Rossow, William B.; Liu, Li; Cairns, Brian

2015-10-01

The primary product of the Global Aerosol Climatology Project (GACP) is a continuous record of the aerosol optical thickness (AOT) over the oceans. It is based on channel-1 and -2 radiance data from the Advanced Very High Resolution Radiometer (AVHRR) instruments flown on successive National Oceanic and Atmospheric Administration (NOAA) platforms. We extend the previous GACP dataset by four years through the end of 2009 using NOAA-17 and -18 AVHRR radiances recalibrated against MODerate resolution Imaging Spectroradiometer (MODIS) radiance data, thereby making the GACP record almost three decades long. The temporal overlap of over three years of the new NOAA-17 and the previous NOAA-16 record reveals an excellent agreement of the corresponding global monthly mean AOT values, thereby confirming the robustness of the vicarious radiance calibration used in the original GACP product. The temporal overlap of the NOAA-17 and -18 instruments is used to introduce a small additive adjustment to the channel-2 calibration of the latter resulting in a consistent record with increased data density. The Principal Component Analysis (PCA) of the newly extended GACP record shows that most of the volcanic AOT variability can be isolated into one mode responsible for ~ 12% of the total variance. This conclusion is confirmed by a combined PCA analysis of the GACP, MODIS, and Multi-angle Imaging SpectroRadiometer (MISR) AOTs during the volcano-free period from February 2000 to December 2009. We show that the modes responsible for the tropospheric AOT variability in the three datasets agree well in terms of correlation and spatial patterns. A previously identified negative AOT trend which started in the late 1980s and continued into the early 2000s is confirmed. Its magnitude and duration indicate that it was caused by changes in tropospheric aerosols. The latest multi-satellite segment of the GACP record shows that this trend tapered off, with no noticeable AOT change after 2002. This
Validation of Long-Term Global Aerosol Climatology Project Optical Thickness Retrievals Using AERONET and MODIS Data

Science.gov (United States)

Geogdzhayev, Igor V.; Mishchenko, Michael I.

2015-01-01

A comprehensive set of monthly mean aerosol optical thickness (AOT) data from coastal and island AErosol RObotic NETwork (AERONET) stations is used to evaluate Global Aerosol Climatology Project (GACP) retrievals for the period 1995-2009 during which contemporaneous GACP and AERONET data were available. To put the GACP performance in broader perspective, we also compare AERONET and MODerate resolution Imaging Spectroradiometer (MODIS) Aqua level-2 data for 2003-2009 using the same methodology. We find that a large mismatch in geographic coverage exists between the satellite and ground-based datasets, with very limited AERONET coverage of open-ocean areas. This is especially true of GACP because of the smaller number of AERONET stations at the early stages of the network development. Monthly mean AOTs from the two over-the-ocean satellite datasets are well-correlated with the ground-based values, the correlation coefficients being 0.81-0.85 for GACP and 0.74-0.79 for MODIS. Regression analyses demonstrate that the GACP mean AOTs are approximately 17%-27% lower than the AERONET values on average, while the MODIS mean AOTs are 5%-25% higher. The regression coefficients are highly dependent on the weighting assumptions (e.g., on the measure of aerosol variability) as well as on the set of AERONET stations used for comparison. Comparison of over-the-land and over-the-ocean MODIS monthly mean AOTs in the vicinity of coastal AERONET stations reveals a significant bias. This may indicate that aerosol amounts in coastal locations can differ significantly from those in adjacent open-ocean areas. Furthermore, the color of coastal waters and peculiarities of coastline meteorological conditions may introduce biases in the GACP AOT retrievals. We conclude that the GACP and MODIS over-the-ocean retrieval algorithms show similar ranges of discrepancy when compared to available coastal and island AERONET stations. The factors mentioned above may limit the performance of the
Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications

Science.gov (United States)

Maskey, M.; Ramachandran, R.; Miller, J.

2017-12-01

Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.
ClimateNet: A Machine Learning dataset for Climate Science Research

Science.gov (United States)

Prabhat, M.; Biard, J.; Ganguly, S.; Ames, S.; Kashinath, K.; Kim, S. K.; Kahou, S.; Maharaj, T.; Beckham, C.; O'Brien, T. A.; Wehner, M. F.; Williams, D. N.; Kunkel, K.; Collins, W. D.

2017-12-01

Deep Learning techniques have revolutionized commercial applications in Computer vision, speech recognition and control systems. The key for all of these developments was the creation of a curated, labeled dataset ImageNet, for enabling multiple research groups around the world to develop methods, benchmark performance and compete with each other. The success of Deep Learning can be largely attributed to the broad availability of this dataset. Our empirical investigations have revealed that Deep Learning is similarly poised to benefit the task of pattern detection in climate science. Unfortunately, labeled datasets, a key pre-requisite for training, are hard to find. Individual research groups are typically interested in specialized weather patterns, making it hard to unify, and share datasets across groups and institutions. In this work, we are proposing ClimateNet: a labeled dataset that provides labeled instances of extreme weather patterns, as well as associated raw fields in model and observational output. We develop a schema in NetCDF to enumerate weather pattern classes/types, store bounding boxes, and pixel-masks. We are also working on a TensorFlow implementation to natively import such NetCDF datasets, and are providing a reference convolutional architecture for binary classification tasks. Our hope is that researchers in Climate Science, as well as ML/DL, will be able to use (and extend) ClimateNet to make rapid progress in the application of Deep Learning for Climate Science research.
The Path from Large Earth Science Datasets to Information

Science.gov (United States)

Vicente, G. A.

2013-12-01

The NASA Goddard Earth Sciences Data (GES) and Information Services Center (DISC) is one of the major Science Mission Directorate (SMD) for archiving and distribution of Earth Science remote sensing data, products and services. This virtual portal provides convenient access to Atmospheric Composition and Dynamics, Hydrology, Precipitation, Ozone, and model derived datasets (generated by GSFC's Global Modeling and Assimilation Office), the North American Land Data Assimilation System (NLDAS) and the Global Land Data Assimilation System (GLDAS) data products (both generated by GSFC's Hydrological Sciences Branch). This presentation demonstrates various tools and computational technologies developed in the GES DISC to manage the huge volume of data and products acquired from various missions and programs over the years. It explores approaches to archive, document, distribute, access and analyze Earth Science data and information as well as addresses the technical and scientific issues, governance and user support problem faced by scientists in need of multi-disciplinary datasets. It also discusses data and product metrics, user distribution profiles and lessons learned through interactions with the science communities around the world. Finally it demonstrates some of the most used data and product visualization and analyses tools developed and maintained by the GES DISC.
A new dataset validation system for the Planetary Science Archive

Science.gov (United States)

Manaud, N.; Zender, J.; Heather, D.; Martinez, S.

2007-08-01

The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that
The role of metadata in managing large environmental science datasets. Proceedings

Energy Technology Data Exchange (ETDEWEB)

Melton, R.B.; DeVaney, D.M. [eds.] [Pacific Northwest Lab., Richland, WA (United States); French, J. C. [Univ. of Virginia, (United States)

1995-06-01

The purpose of this workshop was to bring together computer science researchers and environmental sciences data management practitioners to consider the role of metadata in managing large environmental sciences datasets. The objectives included: establishing a common definition of metadata; identifying categories of metadata; defining problems in managing metadata; and defining problems related to linking metadata with primary data.
The Planetary Science Archive (PSA): Exploration and discovery of scientific datasets from ESA's planetary missions

Science.gov (United States)

Vallat, C.; Besse, S.; Barbarisi, I.; Arviset, C.; De Marchi, G.; Barthelemy, M.; Coia, D.; Costa, M.; Docasal, R.; Fraga, D.; Heather, D. J.; Lim, T.; Macfarlane, A.; Martinez, S.; Rios, C.; Vallejo, F.; Said, J.

2017-09-01

The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://psa.esa.int. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA has started to implement a number of significant improvements, mostly driven by the evolution of the PDS standards, and the growing need for better interfaces and advanced applications to support science exploitation.
A Hybrid Neuro-Fuzzy Model For Integrating Large Earth-Science Datasets

Science.gov (United States)

Porwal, A.; Carranza, J.; Hale, M.

2004-12-01

A GIS-based hybrid neuro-fuzzy approach to integration of large earth-science datasets for mineral prospectivity mapping is described. It implements a Takagi-Sugeno type fuzzy inference system in the framework of a four-layered feed-forward adaptive neural network. Each unique combination of the datasets is considered a feature vector whose components are derived by knowledge-based ordinal encoding of the constituent datasets. A subset of feature vectors with a known output target vector (i.e., unique conditions known to be associated with either a mineralized or a barren location) is used for the training of an adaptive neuro-fuzzy inference system. Training involves iterative adjustment of parameters of the adaptive neuro-fuzzy inference system using a hybrid learning procedure for mapping each training vector to its output target vector with minimum sum of squared error. The trained adaptive neuro-fuzzy inference system is used to process all feature vectors. The output for each feature vector is a value that indicates the extent to which a feature vector belongs to the mineralized class or the barren class. These values are used to generate a prospectivity map. The procedure is demonstrated by an application to regional-scale base metal prospectivity mapping in a study area located in the Aravalli metallogenic province (western India). A comparison of the hybrid neuro-fuzzy approach with pure knowledge-driven fuzzy and pure data-driven neural network approaches indicates that the former offers a superior method for integrating large earth-science datasets for predictive spatial mathematical modelling.
Explore Earth Science Datasets for STEM with the NASA GES DISC Online Visualization and Analysis Tool, Giovanni

Science.gov (United States)

Liu, Z.; Acker, J.; Kempler, S.

2016-01-01

The NASA Goddard Earth Sciences (GES) Data and Information Services Center(DISC) is one of twelve NASA Science Mission Directorate (SMD) Data Centers that provide Earth science data, information, and services to users around the world including research and application scientists, students, citizen scientists, etc. The GESDISC is the home (archive) of remote sensing datasets for NASA Precipitation and Hydrology, Atmospheric Composition and Dynamics, etc. To facilitate Earth science data access, the GES DISC has been developing user-friendly data services for users at different levels in different countries. Among them, the Geospatial Interactive Online Visualization ANd aNalysis Infrastructure (Giovanni, http:giovanni.gsfc.nasa.gov) allows users to explore satellite-based datasets using sophisticated analyses and visualization without downloading data and software, which is particularly suitable for novices (such as students) to use NASA datasets in STEM (science, technology, engineering and mathematics) activities. In this presentation, we will briefly introduce Giovanni along with examples for STEM activities.
Introducing a Web API for Dataset Submission into a NASA Earth Science Data Center

Science.gov (United States)

Moroni, D. F.; Quach, N.; Francis-Curley, W.

2016-12-01

As the landscape of data becomes increasingly more diverse in the domain of Earth Science, the challenges of managing and preserving data become more onerous and complex, particularly for data centers on fixed budgets and limited staff. Many solutions already exist to ease the cost burden for the downstream component of the data lifecycle, yet most archive centers are still racing to keep up with the influx of new data that still needs to find a quasi-permanent resting place. For instance, having well-defined metadata that is consistent across the entire data landscape provides for well-managed and preserved datasets throughout the latter end of the data lifecycle. Translators between different metadata dialects are already in operational use, and facilitate keeping older datasets relevant in today's world of rapidly evolving metadata standards. However, very little is done to address the first phase of the lifecycle, which deals with the entry of both data and the corresponding metadata into a system that is traditionally opaque and closed off to external data producers, thus resulting in a significant bottleneck to the dataset submission process. The ATRAC system was the NOAA NCEI's answer to this previously obfuscated barrier to scientists wishing to find a home for their climate data records, providing a web-based entry point to submit timely and accurate metadata and information about a very specific dataset. A couple of NASA's Distributed Active Archive Centers (DAACs) have implemented their own versions of a web-based dataset and metadata submission form including the ASDC and the ORNL DAAC. The Physical Oceanography DAAC is the most recent in the list of NASA-operated DAACs who have begun to offer their own web-based dataset and metadata submission services to data producers. What makes the PO.DAAC dataset and metadata submission service stand out from these pre-existing services is the option of utilizing both a web browser GUI and a RESTful API to
The New Planetary Science Archive (PSA): Exploration and Discovery of Scientific Datasets from ESA's Planetary Missions

Science.gov (United States)

Heather, David; Besse, Sebastien; Vallat, Claire; Barbarisi, Isa; Arviset, Christophe; De Marchi, Guido; Barthelemy, Maud; Coia, Daniela; Costa, Marc; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; MacFarlane, Alan; Martinez, Santa; Rios, Carlos; Vallejo, Fran; Saiz, Jaime

2017-04-01

The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://psa.esa.int. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA is currently implementing a number of significant improvements, mostly driven by the evolution of the PDS standard, and the growing need for better interfaces and advanced applications to support science exploitation. As of the end of 2016, the PSA is hosting data from all of ESA's planetary missions. This includes ESA's first planetary mission Giotto that encountered comet 1P/Halley in 1986 with a flyby at 800km. Science data from Venus Express, Mars Express, Huygens and the SMART-1 mission are also all available at the PSA. The PSA also contains all science data from Rosetta, which explored comet 67P/Churyumov-Gerasimenko and asteroids Steins and Lutetia. The year 2016 has seen the arrival of the ExoMars 2016 data in the archive. In the upcoming years, at least three new projects are foreseen to be fully archived at the PSA. The BepiColombo mission is scheduled for launch in 2018. Following that, the ExoMars Rover Surface Platform (RSP) in 2020, and then the JUpiter ICy moon Explorer (JUICE). All of these will archive their data in the PSA. In addition, a few ground-based support programmes are also available, especially for the Venus Express and Rosetta missions. The newly designed PSA will enhance the user experience and will significantly reduce the complexity for users to find their data promoting one-click access to the scientific datasets with more customized views when needed. This includes a better integration with Planetary GIS analysis tools and Planetary interoperability services (search and retrieve data, supporting e.g. PDAP, EPN-TAP). It will also be up
The new Planetary Science Archive: A tool for exploration and discovery of scientific datasets from ESA's planetary missions

Science.gov (United States)

Heather, David

2016-07-01

Introduction: The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces (e.g. FTP browser, Map based, Advanced search, and Machine interface): http://archives.esac.esa.int/psa All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. Updating the PSA: The PSA is currently implementing a number of significant changes, both to its web-based interface to the scientific community, and to its database structure. The new PSA will be up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's upcoming ExoMars and BepiColombo missions. The newly designed PSA homepage will provide direct access to scientific datasets via a text search for targets or missions. This will significantly reduce the complexity for users to find their data and will promote one-click access to the datasets. Additionally, the homepage will provide direct access to advanced views and searches of the datasets. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. A login mechanism will provide additional functionalities to the users to aid / ease their searches (e.g. saving queries, managing default views). Queries to the PSA database will be possible either via the homepage (for simple searches of missions or targets), or through a filter menu for more tailored queries. The filter menu will offer multiple options to search for a particular dataset or product, and will manage queries for both in-situ and remote sensing instruments. Parameters such as start-time, phase angle, and heliocentric distance will be emphasized. A further
Linking Disparate Datasets of the Earth Sciences with the SemantEco Annotator

Science.gov (United States)

Seyed, P.; Chastain, K.; McGuinness, D. L.

2013-12-01

Use of Semantic Web technologies for data management in the Earth sciences (and beyond) has great potential but is still in its early stages, since the challenges of translating data into a more explicit or semantic form for immediate use within applications has not been fully addressed. In this abstract we help address this challenge by introducing the SemantEco Annotator, which enables anyone, regardless of expertise, to semantically annotate tabular Earth Science data and translate it into linked data format, while applying the logic inherent in community-standard vocabularies to guide the process. The Annotator was conceived under a desire to unify dataset content from a variety of sources under common vocabularies, for use in semantically-enabled web applications. Our current use case employs linked data generated by the Annotator for use in the SemantEco environment, which utilizes semantics to help users explore, search, and visualize water or air quality measurement and species occurrence data through a map-based interface. The generated data can also be used immediately to facilitate discovery and search capabilities within 'big data' environments. The Annotator provides a method for taking information about a dataset, that may only be known to its maintainers, and making it explicit, in a uniform and machine-readable fashion, such that a person or information system can more easily interpret the underlying structure and meaning. Its primary mechanism is to enable a user to formally describe how columns of a tabular dataset relate and/or describe entities. For example, if a user identifies columns for latitude and longitude coordinates, we can infer the data refers to a point that can be plotted on a map. Further, it can be made explicit that measurements of 'nitrate' and 'NO3-' are of the same entity through vocabulary assignments, thus more easily utilizing data sets that use different nomenclatures. The Annotator provides an extensive and searchable
Global two-channel AVHRR aerosol climatology: effects of stratospheric aerosols and preliminary comparisons with MODIS and MISR retrievals

International Nuclear Information System (INIS)

Geogdzhayev, Igor V.; Mishchenko, Michael I.; Liu Li; Remer, Lorraine

2004-01-01

We present an update on the status of the global climatology of the aerosol column optical thickness and Angstrom exponent derived from channel-1 and -2 radiances of the Advanced Very High Resolution Radiometer (AVHRR) in the framework of the Global Aerosol Climatology Project (GACP). The latest version of the climatology covers the period from July 1983 to September 2001 and is based on an adjusted value of the diffuse component of the ocean reflectance as derived from extensive comparisons with ship sun-photometer data. We use the updated GACP climatology and Stratospheric Aerosol and Gas Experiment (SAGE) data to analyze how stratospheric aerosols from major volcanic eruptions can affect the GACP aerosol product. One possible retrieval strategy based on the AVHRR channel-1 and -2 data alone is to infer both the stratospheric and the tropospheric aerosol optical thickness while assuming fixed microphysical models for both aerosol components. The second approach is to use the SAGE stratospheric aerosol data in order to constrain the AVHRR retrieval algorithm. We demonstrate that the second approach yields a consistent long-term record of the tropospheric aerosol optical thickness and Angstrom exponent. Preliminary comparisons of the GACP aerosol product with MODerate resolution Imaging Spectrometer (MODIS) and Multiangle Imaging Spectro-Radiometer aerosol retrievals show reasonable agreement, the GACP global monthly optical thickness being lower than the MODIS one by approximately 0.03. Larger differences are observed on a regional scale. Comparisons of the GACP and MODIS Angstrom exponent records are less conclusive and require further analysis
An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets

Directory of Open Access Journals (Sweden)

Kang Zhang

2014-01-01

Full Text Available Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.
Heuristics for Relevancy Ranking of Earth Dataset Search Results

Science.gov (United States)

Lynnes, Christopher; Quinn, Patrick; Norton, James

2016-01-01

As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.
Framework for Interactive Parallel Dataset Analysis on the Grid

Energy Technology Data Exchange (ETDEWEB)

Alexander, David A.; Ananthan, Balamurali; /Tech-X Corp.; Johnson, Tony; Serbo, Victor; /SLAC

2007-01-10

We present a framework for use at a typical Grid site to facilitate custom interactive parallel dataset analysis targeting terabyte-scale datasets of the type typically produced by large multi-institutional science experiments. We summarize the needs for interactive analysis and show a prototype solution that satisfies those needs. The solution consists of desktop client tool and a set of Web Services that allow scientists to sign onto a Grid site, compose analysis script code to carry out physics analysis on datasets, distribute the code and datasets to worker nodes, collect the results back to the client, and to construct professional-quality visualizations of the results.
An integrated pan-tropical biomass map using multiple reference datasets

NARCIS (Netherlands)

Avitabile, V.; Herold, M.; Heuvelink, G.B.M.; Lewis, S.L.; Phillips, O.L.; Asner, G.P.; Armston, J.; Asthon, P.; Banin, L.F.; Bayol, N.; Berry, N.; Boeckx, P.; Jong, De B.; Devries, B.; Girardin, C.; Kearsley, E.; Lindsell, J.A.; Lopez-gonzalez, G.; Lucas, R.; Malhi, Y.; Morel, A.; Mitchard, E.; Nagy, L.; Qie, L.; Quinones, M.; Ryan, C.M.; Slik, F.; Sunderland, T.; Vaglio Laurin, G.; Valentini, R.; Verbeeck, H.; Wijaya, A.; Willcock, S.

2016-01-01

We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of
Gulf Atlantic Coastal Plain Long Term Agroecosystem Research site, Tifton, GA

Science.gov (United States)

Timothy Strickland; David D. Bosch; Dinku M. Endale; Thomas L. Potter

2016-01-01

The Gulf-Atlantic Coastal Plain (GACP) physiographic region is an important agriculturalÂ production area within the southeastern U.S. that extends from Delaware in the Northeast toÂ the Gulf Coast of Texas. The region consists mainly of low-elevation flat to rolling terrain withÂ numerous streams, abundant rainfall, a complex coastline, and many wetlands. The GACP LongÂ ...

Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

Data.gov (United States)

National Aeronautics and Space Administration — We propose to mine and utilize the combination of Earth Science dataset, metadata with usage metrics and user feedback to objectively extract relevance for improved...
CHARMe Commentary metadata for Climate Science: collecting, linking and sharing user feedback on climate datasets

Science.gov (United States)

Blower, Jon; Lawrence, Bryan; Kershaw, Philip; Nagni, Maurizio

2014-05-01

The research process can be thought of as an iterative activity, initiated based on prior domain knowledge, as well on a number of external inputs, and producing a range of outputs including datasets, studies and peer reviewed publications. These outputs may describe the problem under study, the methodology used, the results obtained, etc. In any new publication, the author may cite or comment other papers or datasets in order to support their research hypothesis. However, as their work progresses, the researcher may draw from many other latent channels of information. These could include for example, a private conversation following a lecture or during a social dinner; an opinion expressed concerning some significant event such as an earthquake or for example a satellite failure. In addition, other sources of information of grey literature are important public such as informal papers such as the arxiv deposit, reports and studies. The climate science community is not an exception to this pattern; the CHARMe project, funded under the European FP7 framework, is developing an online system for collecting and sharing user feedback on climate datasets. This is to help users judge how suitable such climate data are for an intended application. The user feedback could be comments about assessments, citations, or provenance of the dataset, or other information such as descriptions of uncertainty or data quality. We define this as a distinct category of metadata called Commentary or C-metadata. We link C-metadata with target climate datasets using a Linked Data approach via the Open Annotation data model. In the context of Linked Data, C-metadata plays the role of a resource which, depending on its nature, may be accessed as simple text or as more structured content. The project is implementing a range of software tools to create, search or visualize C-metadata including a JavaScript plugin enabling this functionality to be integrated in situ with data provider portals
Cross-Cultural Concept Mapping of Standardized Datasets

DEFF Research Database (Denmark)

Kano Glückstad, Fumiko

2012-01-01

This work compares four feature-based similarity measures derived from cognitive sciences. The purpose of the comparative analysis is to verify the potentially most effective model that can be applied for mapping independent ontologies in a culturally influenced domain [1]. Here, datasets based...
3DSEM: A 3D microscopy dataset

Directory of Open Access Journals (Sweden)

Ahmad P. Tafti

2016-03-01

Full Text Available The Scanning Electron Microscope (SEM as a 2D imaging instrument has been widely used in many scientific disciplines including biological, mechanical, and materials sciences to determine the surface attributes of microscopic objects. However the SEM micrographs still remain 2D images. To effectively measure and visualize the surface properties, we need to truly restore the 3D shape model from 2D SEM images. Having 3D surfaces would provide anatomic shape of micro-samples which allows for quantitative measurements and informative visualization of the specimens being investigated. The 3DSEM is a dataset for 3D microscopy vision which is freely available at [1] for any academic, educational, and research purposes. The dataset includes both 2D images and 3D reconstructed surfaces of several real microscopic samples. Keywords: 3D microscopy dataset, 3D microscopy vision, 3D SEM surface reconstruction, Scanning Electron Microscope (SEM
Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.

Science.gov (United States)

Kohli, Marc D; Summers, Ronald M; Geis, J Raymond

2017-08-01

At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities.
An integrated pan-tropical biomass map using multiple reference datasets

OpenAIRE

Avitabile, V.; Herold, M.; Heuvelink, G. B. M.; Lewis, S. L.; Phillips, O. L.; Asner, G. P.; Armston, J.; Ashton, P. S.; Banin, L.; Bayol, N.; Berry, N. J.; Boeckx, P.; de Jong, B. H. J.; DeVries, B.; Girardin, C. A. J.

2016-01-01

We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of field observations and locally calibrated high-resolution biomass maps, harmonized and upscaled to 14 477 1-km AGB estimates. Our data fusion approach uses bias removal and weighted linear averaging...
Provenance Challenges for Earth Science Dataset Publication

Science.gov (United States)

Tilmes, Curt

2011-01-01

Modern science is increasingly dependent on computational analysis of very large data sets. Organizing, referencing, publishing those data has become a complex problem. Published research that depends on such data often fails to cite the data in sufficient detail to allow an independent scientist to reproduce the original experiments and analyses. This paper explores some of the challenges related to data identification, equivalence and reproducibility in the domain of data intensive scientific processing. It will use the example of Earth Science satellite data, but the challenges also apply to other domains.
Would the ‘real’ observed dataset stand up? A critical examination of eight observed gridded climate datasets for China

International Nuclear Information System (INIS)

Sun, Qiaohong; Miao, Chiyuan; Duan, Qingyun; Kong, Dongxian; Ye, Aizhong; Di, Zhenhua; Gong, Wei

2014-01-01

This research compared and evaluated the spatio-temporal similarities and differences of eight widely used gridded datasets. The datasets include daily precipitation over East Asia (EA), the Climate Research Unit (CRU) product, the Global Precipitation Climatology Centre (GPCC) product, the University of Delaware (UDEL) product, Precipitation Reconstruction over Land (PREC/L), the Asian Precipitation Highly Resolved Observational (APHRO) product, the Institute of Atmospheric Physics (IAP) dataset from the Chinese Academy of Sciences, and the National Meteorological Information Center dataset from the China Meteorological Administration (CN05). The meteorological variables focus on surface air temperature (SAT) or precipitation (PR) in China. All datasets presented general agreement on the whole spatio-temporal scale, but some differences appeared for specific periods and regions. On a temporal scale, EA shows the highest amount of PR, while APHRO shows the lowest. CRU and UDEL show higher SAT than IAP or CN05. On a spatial scale, the most significant differences occur in western China for PR and SAT. For PR, the difference between EA and CRU is the largest. When compared with CN05, CRU shows higher SAT in the central and southern Northwest river drainage basin, UDEL exhibits higher SAT over the Southwest river drainage system, and IAP has lower SAT in the Tibetan Plateau. The differences in annual mean PR and SAT primarily come from summer and winter, respectively. Finally, potential factors impacting agreement among gridded climate datasets are discussed, including raw data sources, quality control (QC) schemes, orographic correction, and interpolation techniques. The implications and challenges of these results for climate research are also briefly addressed. (paper)
Scientific Datasets: Discovery and Aggregation for Semantic Interpretation.

Science.gov (United States)

Lopez, L. A.; Scott, S.; Khalsa, S. J. S.; Duerr, R.

2015-12-01

One of the biggest challenges that interdisciplinary researchers face is finding suitable datasets in order to advance their science; this problem remains consistent across multiple disciplines. A surprising number of scientists, when asked what tool they use for data discovery, reply "Google", which is an acceptable solution in some cases but not even Google can find -or cares to compile- all the data that's relevant for science and particularly geo sciences. If a dataset is not discoverable through a well known search provider it will remain dark data to the scientific world.For the past year, BCube, an EarthCube Building Block project, has been developing, testing and deploying a technology stack capable of data discovery at web-scale using the ultimate dataset: The Internet. This stack has 2 principal components, a web-scale crawling infrastructure and a semantic aggregator. The web-crawler is a modified version of Apache Nutch (the originator of Hadoop and other big data technologies) that has been improved and tailored for data and data service discovery. The second component is semantic aggregation, carried out by a python-based workflow that extracts valuable metadata and stores it in the form of triples through the use semantic technologies.While implementing the BCube stack we have run into several challenges such as a) scaling the project to cover big portions of the Internet at a reasonable cost, b) making sense of very diverse and non-homogeneous data, and lastly, c) extracting facts about these datasets using semantic technologies in order to make them usable for the geosciences community. Despite all these challenges we have proven that we can discover and characterize data that otherwise would have remained in the dark corners of the Internet. Having all this data indexed and 'triplelized' will enable scientists to access a trove of information relevant to their work in a more natural way. An important characteristic of the BCube stack is that all
Wind and wave dataset for Matara, Sri Lanka

Science.gov (United States)

Luo, Yao; Wang, Dongxiao; Priyadarshana Gamage, Tilak; Zhou, Fenghua; Madusanka Widanage, Charith; Liu, Taiwei

2018-01-01

We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1) is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017) is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447).
Wind and wave dataset for Matara, Sri Lanka

Directory of Open Access Journals (Sweden)

Y. Luo

2018-01-01

Full Text Available We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1 is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017 is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447.
The OXL format for the exchange of integrated datasets

Directory of Open Access Journals (Sweden)

Taubert Jan

2007-12-01

Full Text Available A prerequisite for systems biology is the integration and analysis of heterogeneous experimental data stored in hundreds of life-science databases and millions of scientific publications. Several standardised formats for the exchange of specific kinds of biological information exist. Such exchange languages facilitate the integration process; however they are not designed to transport integrated datasets. A format for exchanging integrated datasets needs to i cover data from a broad range of application domains, ii be flexible and extensible to combine many different complex data structures, iii include metadata and semantic definitions, iv include inferred information, v identify the original data source for integrated entities and vi transport large integrated datasets. Unfortunately, none of the exchange formats from the biological domain (e.g. BioPAX, MAGE-ML, PSI-MI, SBML or the generic approaches (RDF, OWL fulfil these requirements in a systematic way.
Map Coordinate Referencing and the use of GPS Datasets in Ghana ...

African Journals Online (AJOL)

Map Coordinate Referencing and the use of GPS Datasets in Ghana. ... Journal of Science and Technology (Ghana) ... systems used in Ghana (the Ghana war office system and also the Clarke1880 system) using the Bursa-Wolf model.
Toward computational cumulative biology by combining models of biological datasets.

Science.gov (United States)

Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

2014-01-01

A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
A novel method to measure conspicuous facial pores using computer analysis of digital-camera-captured images: the effect of glycolic acid chemical peeling.

Science.gov (United States)

Kakudo, Natsuko; Kushida, Satoshi; Tanaka, Nobuko; Minakata, Tatsuya; Suzuki, Kenji; Kusumoto, Kenji

2011-11-01

Chemical peeling is becoming increasingly popular for skin rejuvenation in dermatological esthetic surgery. Conspicuous facial pores are one of the most frequently encountered skin problems in women of all ages. This study was performed to analyze the effectiveness of reducing conspicuous facial pores using glycolic acid chemical peeling (GACP) based on a novel computer analysis of digital-camera-captured images. GACP was performed a total of five times at 2-week intervals in 22 healthy women. Computerized image analysis of conspicuous, open, and darkened facial pores was performed using the Robo Skin Analyzer CS 50. The number of conspicuous facial pores decreased significantly in 19 (86%) of the 22 subjects, with a mean improvement rate of 34.6%. The number of open pores decreased significantly in 16 (72%) of the subjects, with a mean improvement rate of 11.0%. The number of darkened pores decreased significantly in 18 (81%) of the subjects, with a mean improvement rate of 34.3%. GACP significantly reduces the number of conspicuous facial pores. The Robo Skin Analyzer CS 50 is useful for the quantification and analysis of 'pore enlargement', a subtle finding in dermatological esthetic surgery. © 2011 John Wiley & Sons A/S.
On the visualization of water-related big data: extracting insights from drought proxies' datasets

Science.gov (United States)

Diaz, Vitali; Corzo, Gerald; van Lanen, Henny A. J.; Solomatine, Dimitri

2017-04-01

Big data is a growing area of science where hydroinformatics can benefit largely. There have been a number of important developments in the area of data science aimed at analysis of large datasets. Such datasets related to water include measurements, simulations, reanalysis, scenario analyses and proxies. By convention, information contained in these databases is referred to a specific time and a space (i.e., longitude/latitude). This work is motivated by the need to extract insights from large water-related datasets, i.e., transforming large amounts of data into useful information that helps to better understand of water-related phenomena, particularly about drought. In this context, data visualization, part of data science, involves techniques to create and to communicate data by encoding it as visual graphical objects. They may help to better understand data and detect trends. Base on existing methods of data analysis and visualization, this work aims to develop tools for visualizing water-related large datasets. These tools were developed taking advantage of existing libraries for data visualization into a group of graphs which include both polar area diagrams (PADs) and radar charts (RDs). In both graphs, time steps are represented by the polar angles and the percentages of area in drought by the radios. For illustration, three large datasets of drought proxies are chosen to identify trends, prone areas and spatio-temporal variability of drought in a set of case studies. The datasets are (1) SPI-TS2p1 (1901-2002, 11.7 GB), (2) SPI-PRECL0p5 (1948-2016, 7.91 GB) and (3) SPEI-baseV2.3 (1901-2013, 15.3 GB). All of them are on a monthly basis and with a spatial resolution of 0.5 degrees. First two were retrieved from the repository of the International Research Institute for Climate and Society (IRI). They are included into the Analyses Standardized Precipitation Index (SPI) project (iridl.ldeo.columbia.edu/SOURCES/.IRI/.Analyses/.SPI/). The third dataset was
Immersive Interaction, Manipulation and Analysis of Large 3D Datasets for Planetary and Earth Sciences

Science.gov (United States)

Pariser, O.; Calef, F.; Manning, E. M.; Ardulov, V.

2017-12-01

We will present implementation and study of several use-cases of utilizing Virtual Reality (VR) for immersive display, interaction and analysis of large and complex 3D datasets. These datasets have been acquired by the instruments across several Earth, Planetary and Solar Space Robotics Missions. First, we will describe the architecture of the common application framework that was developed to input data, interface with VR display devices and program input controllers in various computing environments. Tethered and portable VR technologies will be contrasted and advantages of each highlighted. We'll proceed to presenting experimental immersive analytics visual constructs that enable augmentation of 3D datasets with 2D ones such as images and statistical and abstract data. We will conclude by presenting comparative analysis with traditional visualization applications and share the feedback provided by our users: scientists and engineers.
EPA Nanorelease Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — EPA Nanorelease Dataset. This dataset is associated with the following publication: Wohlleben, W., C. Kingston, J. Carter, E. Sahle-Demessie, S. Vazquez-Campos, B....
Thermodynamic Data Rescue and Informatics for Deep Carbon Science

Science.gov (United States)

Zhong, H.; Ma, X.; Prabhu, A.; Eleish, A.; Pan, F.; Parsons, M. A.; Ghiorso, M. S.; West, P.; Zednik, S.; Erickson, J. S.; Chen, Y.; Wang, H.; Fox, P. A.

2017-12-01

A large number of legacy datasets are contained in geoscience literature published between 1930 and 1980 and not expressed external to the publication text in digitized formats. Extracting, organizing, and reusing these "dark" datasets is highly valuable for many within the Earth and planetary science community. As a part of the Deep Carbon Observatory (DCO) data legacy missions, the DCO Data Science Team and Extreme Physics and Chemistry community identified thermodynamic datasets related to carbon, or more specifically datasets about the enthalpy and entropy of chemicals, as a proof of principle analysis. The data science team endeavored to develop a semi-automatic workflow, which includes identifying relevant publications, extracting contained datasets using OCR methods, collaborative reviewing, and registering the datasets via the DCO Data Portal where the 'Linked Data' feature of the data portal provides a mechanism for connecting rescued datasets beyond their individual data sources, to research domains, DCO Communities, and more, making data discovery and retrieval more effective.To date, the team has successfully rescued, deposited and registered additional datasets from publications with thermodynamic sources. These datasets contain 3 main types of data: (1) heat content or enthalpy data determined for a given compound as a function of temperature using high-temperature calorimetry, (2) heat content or enthalpy data determined for a given compound as a function of temperature using adiabatic calorimetry, and (3) direct determination of heat capacity of a compound as a function of temperature using differential scanning calorimetry. The data science team integrated these datasets and delivered a spectrum of data analytics including visualizations, which will lead to a comprehensive characterization of the thermodynamics of carbon and carbon-related materials.
Proteomics dataset

DEFF Research Database (Denmark)

Bennike, Tue Bjerg; Carlsen, Thomas Gelsing; Ellingsen, Torkell

2017-01-01

The datasets presented in this article are related to the research articles entitled “Neutrophil Extracellular Traps in Ulcerative Colitis: A Proteome Analysis of Intestinal Biopsies” (Bennike et al., 2015 [1]), and “Proteome Analysis of Rheumatoid Arthritis Gut Mucosa” (Bennike et al., 2017 [2])...... been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD001608 for ulcerative colitis and control samples, and PXD003082 for rheumatoid arthritis samples....

Building a better search engine for earth science data

Science.gov (United States)

Armstrong, E. M.; Yang, C. P.; Moroni, D. F.; McGibbney, L. J.; Jiang, Y.; Huang, T.; Greguska, F. R., III; Li, Y.; Finch, C. J.

2017-12-01

Free text data searching of earth science datasets has been implemented with varying degrees of success and completeness across the spectrum of the 12 NASA earth sciences data centers. At the JPL Physical Oceanography Distributed Active Archive Center (PO.DAAC) the search engine has been developed around the Solr/Lucene platform. Others have chosen other popular enterprise search platforms like Elasticsearch. Regardless, the default implementations of these search engines leveraging factors such as dataset popularity, term frequency and inverse document term frequency do not fully meet the needs of precise relevancy and ranking of earth science search results. For the PO.DAAC, this shortcoming has been identified for several years by its external User Working Group that has assigned several recommendations to improve the relevancy and discoverability of datasets related to remotely sensed sea surface temperature, ocean wind, waves, salinity, height and gravity that comprise a total count of over 500 public availability datasets. Recently, the PO.DAAC has teamed with an effort led by George Mason University to improve the improve the search and relevancy ranking of oceanographic data via a simple search interface and powerful backend services called MUDROD (Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to Improve Data Discovery) funded by the NASA AIST program. MUDROD has mined and utilized the combination of PO.DAAC earth science dataset metadata, usage metrics, and user feedback and search history to objectively extract relevance for improved data discovery and access. In addition to improved dataset relevance and ranking, the MUDROD search engine also returns recommendations to related datasets and related user queries. This presentation will report on use cases that drove the architecture and development, and the success metrics and improvements on search precision and recall that MUDROD has demonstrated over the existing PO.DAAC search
RARD: The Related-Article Recommendation Dataset

OpenAIRE

Beel, Joeran; Carevic, Zeljko; Schaible, Johann; Neusch, Gabor

2017-01-01

Recommender-system datasets are used for recommender-system evaluations, training machine-learning algorithms, and exploring user behavior. While there are many datasets for recommender systems in the domains of movies, books, and music, there are rather few datasets from research-paper recommender systems. In this paper, we introduce RARD, the Related-Article Recommendation Dataset, from the digital library Sowiport and the recommendation-as-a-service provider Mr. DLib. The dataset contains ...
Data Discovery of Big and Diverse Climate Change Datasets - Options, Practices and Challenges

Science.gov (United States)

Palanisamy, G.; Boden, T.; McCord, R. A.; Frame, M. T.

2013-12-01

Developing data search tools is a very common, but often confusing, task for most of the data intensive scientific projects. These search interfaces need to be continually improved to handle the ever increasing diversity and volume of data collections. There are many aspects which determine the type of search tool a project needs to provide to their user community. These include: number of datasets, amount and consistency of discovery metadata, ancillary information such as availability of quality information and provenance, and availability of similar datasets from other distributed sources. Environmental Data Science and Systems (EDSS) group within the Environmental Science Division at the Oak Ridge National Laboratory has a long history of successfully managing diverse and big observational datasets for various scientific programs via various data centers such as DOE's Atmospheric Radiation Measurement Program (ARM), DOE's Carbon Dioxide Information and Analysis Center (CDIAC), USGS's Core Science Analytics and Synthesis (CSAS) metadata Clearinghouse and NASA's Distributed Active Archive Center (ORNL DAAC). This talk will showcase some of the recent developments for improving the data discovery within these centers The DOE ARM program recently developed a data discovery tool which allows users to search and discover over 4000 observational datasets. These datasets are key to the research efforts related to global climate change. The ARM discovery tool features many new functions such as filtered and faceted search logic, multi-pass data selection, filtering data based on data quality, graphical views of data quality and availability, direct access to data quality reports, and data plots. The ARM Archive also provides discovery metadata to other broader metadata clearinghouses such as ESGF, IASOA, and GOS. In addition to the new interface, ARM is also currently working on providing DOI metadata records to publishers such as Thomson Reuters and Elsevier. The ARM
Isfahan MISP Dataset.

Science.gov (United States)

Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

2017-01-01

An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).
Development of a SPARK Training Dataset

Energy Technology Data Exchange (ETDEWEB)

Sayre, Amanda M. [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Olson, Jarrod R. [Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

2015-03-01

In its first five years, the National Nuclear Security Administration’s (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and across locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK’s intended analysis capability. The analysis demonstration sought to answer the
Big Data in Space Science

OpenAIRE

Barmby, Pauline

2018-01-01

It seems like “big data” is everywhere these days. In planetary science and astronomy, we’ve been dealing with large datasets for a long time. So how “big” is our data? How does it compare to the big data that a bank or an airline might have? What new tools do we need to analyze big datasets, and how can we make better use of existing tools? What kinds of science problems can we address with these? I’ll address these questions with examples including ESA’s Gaia mission, ...
Open University Learning Analytics dataset.

Science.gov (United States)

Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek

2017-11-28

Learning Analytics focuses on the collection and analysis of learners' data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students' interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.
fCCAC: functional canonical correlation analysis to evaluate covariance between nucleic acid sequencing datasets.

Science.gov (United States)

Madrigal, Pedro

2017-03-01

Computational evaluation of variability across DNA or RNA sequencing datasets is a crucial step in genomic science, as it allows both to evaluate reproducibility of biological or technical replicates, and to compare different datasets to identify their potential correlations. Here we present fCCAC, an application of functional canonical correlation analysis to assess covariance of nucleic acid sequencing datasets such as chromatin immunoprecipitation followed by deep sequencing (ChIP-seq). We show how this method differs from other measures of correlation, and exemplify how it can reveal shared covariance between histone modifications and DNA binding proteins, such as the relationship between the H3K4me3 chromatin mark and its epigenetic writers and readers. An R/Bioconductor package is available at http://bioconductor.org/packages/fCCAC/ . pmb59@cam.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Compilation and analysis of multiple groundwater-quality datasets for Idaho

Science.gov (United States)

Hundt, Stephen A.; Hopkins, Candice B.

2018-05-09

Groundwater is an important source of drinking and irrigation water throughout Idaho, and groundwater quality is monitored by various Federal, State, and local agencies. The historical, multi-agency records of groundwater quality include a valuable dataset that has yet to be compiled or analyzed on a statewide level. The purpose of this study is to combine groundwater-quality data from multiple sources into a single database, to summarize this dataset, and to perform bulk analyses to reveal spatial and temporal patterns of water quality throughout Idaho. Data were retrieved from the Water Quality Portal (https://www.waterqualitydata.us/), the Idaho Department of Environmental Quality, and the Idaho Department of Water Resources. Analyses included counting the number of times a sample location had concentrations above Maximum Contaminant Levels (MCL), performing trends tests, and calculating correlations between water-quality analytes. The water-quality database and the analysis results are available through USGS ScienceBase (https://doi.org/10.5066/F72V2FBG).
Development of a SPARK Training Dataset

International Nuclear Information System (INIS)

Sayre, Amanda M.; Olson, Jarrod R.

2015-01-01

In its first five years, the National Nuclear Security Administration's (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and across locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK's intended analysis capability. The analysis demonstration sought to answer
Project Roadkill: Linking European Hare vehicle collisions with landscape-structure using datasets from citizen scientists and professionals

Science.gov (United States)

Stretz, Carina; Heigl, Florian; Steiner, Wolfgang; Bauer, Thomas; Suppan, Franz; Zaller, Johann G.

2015-04-01

Road networks can implicate lots of negative effects for wildlife. One of the most important indication for strong landscape fragmentation are roadkills, i.e. collisions between motorised vehicles and wild animals. A species that is often involved in roadkills is the European hare (Lepus europaeus). European hare populations are in decline throughout Europe since the 1960s and classified as "potentially endangered" in the Red Data Book of Austria. Therefore, it is striking that in the hunting year 2013/14, 19,343 hares were killed on Austrian roads translating to 53 hare roadkills each day, or rather about two per hour. We hypothesized, that (I) hare-vehicle-collisions occur as an aggregation of events (hotspot), (II) the surrounding landscape influences the number of roadkilled hares and (III) roadkill data from citizen science projects and data from professionals (e.g. hunters, police) are convergent. Investigations on the surrounding landscape of the scenes of accidents will be carried out using land cover data derived from Landsat satellite images. Information on road kills are based on datasets from two different sources. One dataset stems from the citizen science project "Roadkill" (www.citizen-science.at/roadkill) where participants report roadkill findings via a web application. The second dataset is from a project where roadkill data were collected by the police and by hunters. Besides answering our research questions, findings of this project also allow the location of dangerous roadkill hotspots for animals and could be implemented in nature conservation actions.
The French Muséum national d'histoire naturelle vascular plant herbarium collection dataset

Science.gov (United States)

Le Bras, Gwenaël; Pignal, Marc; Jeanson, Marc L.; Muller, Serge; Aupic, Cécile; Carré, Benoît; Flament, Grégoire; Gaudeul, Myriam; Gonçalves, Claudia; Invernón, Vanessa R.; Jabbour, Florian; Lerat, Elodie; Lowry, Porter P.; Offroy, Bérangère; Pimparé, Eva Pérez; Poncy, Odile; Rouhan, Germinal; Haevermans, Thomas

2017-02-01

We provide a quantitative description of the French national herbarium vascular plants collection dataset. Held at the Muséum national d'histoire naturelle, Paris, it currently comprises records for 5,400,000 specimens, representing 90% of the estimated total of specimens. Ninety nine percent of the specimen entries are linked to one or more images and 16% have field-collecting information available. This major botanical collection represents the results of over three centuries of exploration and study. The sources of the collection are global, with a strong representation for France, including overseas territories, and former French colonies. The compilation of this dataset was made possible through numerous national and international projects, the most important of which was linked to the renovation of the herbarium building. The vascular plant collection is actively expanding today, hence the continuous growth exhibited by the dataset, which can be fully accessed through the GBIF portal or the MNHN database portal (available at: https://science.mnhn.fr/institution/mnhn/collection/p/item/search/form). This dataset is a major source of data for systematics, global plants macroecological studies or conservation assessments.
The French Muséum national d’histoire naturelle vascular plant herbarium collection dataset

Science.gov (United States)

Le Bras, Gwenaël; Pignal, Marc; Jeanson, Marc L.; Muller, Serge; Aupic, Cécile; Carré, Benoît; Flament, Grégoire; Gaudeul, Myriam; Gonçalves, Claudia; Invernón, Vanessa R.; Jabbour, Florian; Lerat, Elodie; Lowry, Porter P.; Offroy, Bérangère; Pimparé, Eva Pérez; Poncy, Odile; Rouhan, Germinal; Haevermans, Thomas

2017-01-01

We provide a quantitative description of the French national herbarium vascular plants collection dataset. Held at the Muséum national d’histoire naturelle, Paris, it currently comprises records for 5,400,000 specimens, representing 90% of the estimated total of specimens. Ninety nine percent of the specimen entries are linked to one or more images and 16% have field-collecting information available. This major botanical collection represents the results of over three centuries of exploration and study. The sources of the collection are global, with a strong representation for France, including overseas territories, and former French colonies. The compilation of this dataset was made possible through numerous national and international projects, the most important of which was linked to the renovation of the herbarium building. The vascular plant collection is actively expanding today, hence the continuous growth exhibited by the dataset, which can be fully accessed through the GBIF portal or the MNHN database portal (available at: https://science.mnhn.fr/institution/mnhn/collection/p/item/search/form). This dataset is a major source of data for systematics, global plants macroecological studies or conservation assessments. PMID:28195585
Mridangam stroke dataset

OpenAIRE

CompMusic

2014-01-01

The audio examples were recorded from a professional Carnatic percussionist in a semi-anechoic studio conditions by Akshay Anantapadmanabhan using SM-58 microphones and an H4n ZOOM recorder. The audio was sampled at 44.1 kHz and stored as 16 bit wav files. The dataset can be used for training models for each Mridangam stroke. /n/nA detailed description of the Mridangam and its strokes can be found in the paper below. A part of the dataset was used in the following paper. /nAkshay Anantapadman...
2008 TIGER/Line Nationwide Dataset

Data.gov (United States)

California Natural Resource Agency — This dataset contains a nationwide build of the 2008 TIGER/Line datasets from the US Census Bureau downloaded in April 2009. The TIGER/Line Shapefiles are an extract...
Design of an audio advertisement dataset

Science.gov (United States)

Fu, Yutao; Liu, Jihong; Zhang, Qi; Geng, Yuting

2015-12-01

Since more and more advertisements swarm into radios, it is necessary to establish an audio advertising dataset which could be used to analyze and classify the advertisement. A method of how to establish a complete audio advertising dataset is presented in this paper. The dataset is divided into four different kinds of advertisements. Each advertisement's sample is given in *.wav file format, and annotated with a txt file which contains its file name, sampling frequency, channel number, broadcasting time and its class. The classifying rationality of the advertisements in this dataset is proved by clustering the different advertisements based on Principal Component Analysis (PCA). The experimental results show that this audio advertisement dataset offers a reliable set of samples for correlative audio advertisement experimental studies.
Background qualitative analysis of the European reference life cycle database (ELCD) energy datasets - part II: electricity datasets.

Science.gov (United States)

Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

2015-01-01

The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.
A data discovery index for the social sciences.

Science.gov (United States)

Krämer, Thomas; Klas, Claus-Peter; Hausstein, Brigitte

2018-04-10

This paper describes a novel search index for social and economic research data, one that enables users to search up-to-date references for data holdings in these disciplines. The index can be used for comparative analysis of publication of datasets in different areas of social science. The core of the index is the da|ra registration agency's database for social and economic data, which contains high-quality searchable metadata from registered data publishers. Research data's metadata records are harvested from data providers around the world and included in the index. In this paper, we describe the currently available indices on social science datasets and their shortcomings. Next, we describe the motivation behind and the purpose for the data discovery index as a dedicated and curated platform for finding social science research data and gesisDataSearch, its user interface. Further, we explain the harvesting, filtering and indexing procedure and give usage instructions for the dataset index. Lastly, we show that the index is currently the most comprehensive and most accessible collection of social science data descriptions available.
Journal of Earth System Science | Indian Academy of Sciences

Indian Academy of Sciences (India)

Home; Journals; Journal of Earth System Science; Volume 122; Issue 2 ... Predictors for Bangladesh summer monsoon (June–September) rainfall were identified ... After carrying out a detailed analysis of various global climate datasets; three ... Department of Physics, Bangladesh University of Engineering & Technology ...
The GTZAN dataset

DEFF Research Database (Denmark)

Sturm, Bob L.

2013-01-01

The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge...... of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN...

An integrated pan-tropical biomass map using multiple reference datasets.

Science.gov (United States)

Avitabile, Valerio; Herold, Martin; Heuvelink, Gerard B M; Lewis, Simon L; Phillips, Oliver L; Asner, Gregory P; Armston, John; Ashton, Peter S; Banin, Lindsay; Bayol, Nicolas; Berry, Nicholas J; Boeckx, Pascal; de Jong, Bernardus H J; DeVries, Ben; Girardin, Cecile A J; Kearsley, Elizabeth; Lindsell, Jeremy A; Lopez-Gonzalez, Gabriela; Lucas, Richard; Malhi, Yadvinder; Morel, Alexandra; Mitchard, Edward T A; Nagy, Laszlo; Qie, Lan; Quinones, Marcela J; Ryan, Casey M; Ferry, Slik J W; Sunderland, Terry; Laurin, Gaia Vaglio; Gatti, Roberto Cazzolla; Valentini, Riccardo; Verbeeck, Hans; Wijaya, Arief; Willcock, Simon

2016-04-01

We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of field observations and locally calibrated high-resolution biomass maps, harmonized and upscaled to 14 477 1-km AGB estimates. Our data fusion approach uses bias removal and weighted linear averaging that incorporates and spatializes the biomass patterns indicated by the reference data. The method was applied independently in areas (strata) with homogeneous error patterns of the input (Saatchi and Baccini) maps, which were estimated from the reference data and additional covariates. Based on the fused map, we estimated AGB stock for the tropics (23.4 N-23.4 S) of 375 Pg dry mass, 9-18% lower than the Saatchi and Baccini estimates. The fused map also showed differing spatial patterns of AGB over large areas, with higher AGB density in the dense forest areas in the Congo basin, Eastern Amazon and South-East Asia, and lower values in Central America and in most dry vegetation areas of Africa than either of the input maps. The validation exercise, based on 2118 estimates from the reference dataset not used in the fusion process, showed that the fused map had a RMSE 15-21% lower than that of the input maps and, most importantly, nearly unbiased estimates (mean bias 5 Mg dry mass ha(-1) vs. 21 and 28 Mg ha(-1) for the input maps). The fusion method can be applied at any scale including the policy-relevant national level, where it can provide improved biomass estimates by integrating existing regional biomass maps as input maps and additional, country-specific reference datasets. © 2015 John Wiley & Sons Ltd.
Web Syndication Approaches for Sharing Primary Data in "Small Science" Domains

Directory of Open Access Journals (Sweden)

Eric C Kansa

2010-06-01

Full Text Available In some areas of science, sophisticated web services and semantics underlie "cyberinfrastructure". However, in "small science" domains, especially in field sciences such as archaeology, conservation, and public health, datasets often resist standardization. Publishing data in the small sciences should embrace this diversity rather than attempt to corral research into "universal" (domain standards. A growing ecosystem of increasingly powerful Web syndication based approaches for sharing data on the public Web can offer a viable approach. Atom Feed based services can be used with scientific collections to identify and create linkages across different datasets, even across disciplinary boundaries without shared domain standards.
Editorial: Datasets for Learning Analytics

NARCIS (Netherlands)

Dietze, Stefan; George, Siemens; Davide, Taibi; Drachsler, Hendrik

2018-01-01

The European LinkedUp and LACE (Learning Analytics Community Exchange) project have been responsible for setting up a series of data challenges at the LAK conferences 2013 and 2014 around the LAK dataset. The LAK datasets consists of a rich collection of full text publications in the domain of
Connecting the Public to Scientific Research Data - Science On a Sphere°

Science.gov (United States)

Henderson, M. A.; Russell, E. L.; Science on a Sphere Datasets

2011-12-01

Connecting the Public to Scientific Research Data - Science On a Sphere° Maurice Henderson, NASA Goddard Space Flight Center Elizabeth Russell, NOAA Earth System Research Laboratory, University of Colorado Cooperative Institute for Research in Environmental Sciences Science On a Sphere° is a six foot animated globe developed by the National Ocean and Atmospheric Administration, NOAA, as a means to display global scientific research data in an intuitive, engaging format in public forums. With over 70 permanent installations of SOS around the world in science museums, visitor's centers and universities, the audience that enjoys SOS yearly is substantial, wide-ranging, and diverse. Through partnerships with the National Aeronautics and Space Administration, NASA, the SOS Data Catalog (http://sos.noaa.gov/datasets/) has grown to a collection of over 350 datasets from NOAA, NASA, and many others. Using an external projection system, these datasets are displayed onto the sphere creating a seamless global image. In a cross-site evaluation of Science On a Sphere°, 82% of participants said yes, seeing information displayed on a sphere changed their understanding of the information. This unique technology captivates viewers and exposes them to scientific research data in a way that is accessible, presentable, and understandable. The datasets that comprise the SOS Data Catalog are scientific research data that have been formatted for display on SOS. By formatting research data into visualizations that can be used on SOS, NOAA and NASA are able to turn research data into educational materials that are easily accessible for users. In many cases, visualizations do not need to be modified because SOS uses a common map projection. The SOS Data Catalog has become a "one-stop shop" for a broad range of global datasets from across NOAA and NASA, and as a result, the traffic on the site is more than just SOS users. While the target audience for this site is SOS users, many
The Geometry of Finite Equilibrium Datasets

DEFF Research Database (Denmark)

Balasko, Yves; Tvede, Mich

We investigate the geometry of finite datasets defined by equilibrium prices, income distributions, and total resources. We show that the equilibrium condition imposes no restrictions if total resources are collinear, a property that is robust to small perturbations. We also show that the set...... of equilibrium datasets is pathconnected when the equilibrium condition does impose restrictions on datasets, as for example when total resources are widely non collinear....
The medical science DMZ: a network design pattern for data-intensive medical science

Energy Technology Data Exchange (ETDEWEB)

Peisert, Sean [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Davis, CA (United States). Dept. of computer Science; Corporation for Education Network Initiatives in California (CENIC), Berkeley, CA (United States); Dart, Eli [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). ESnet; Barnett, William [Indiana Univ., Indianapolis, IN (United States). Indiana Clinical and Translational Sciences Inst., Regenstrief Inst.; Balas, Edward [Indiana Univ., Bloomington, IN (United States). Global Research Network Operations Center; Cuff, James [Harvard Univ., Cambridge, MA (United States). Research Computing; Grossman, Robert L. [Univ. of Chicago, IL (United States). Center for Data Intensive Science; Berman, Ari [BioTeam, Middleton, MA (United States); Shankar, Anurag [Indiana Univ., Bloomington, IN (United States). Pervasive Technology Inst.; Tierney, Brian [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). ESnet

2017-10-06

We describe a detailed solution for maintaining high-capacity, data-intensive network flows (eg, 10, 40, 100 Gbps+) in a scientific, medical context while still adhering to security and privacy laws and regulations.High-end networking, packet-filter firewalls, network intrusion-detection systems.We describe a "Medical Science DMZ" concept as an option for secure, high-volume transport of large, sensitive datasets between research institutions over national research networks, and give 3 detailed descriptions of implemented Medical Science DMZs.The exponentially increasing amounts of "omics" data, high-quality imaging, and other rapidly growing clinical datasets have resulted in the rise of biomedical research "Big Data." The storage, analysis, and network resources required to process these data and integrate them into patient diagnoses and treatments have grown to scales that strain the capabilities of academic health centers. Some data are not generated locally and cannot be sustained locally, and shared data repositories such as those provided by the National Library of Medicine, the National Cancer Institute, and international partners such as the European Bioinformatics Institute are rapidly growing. The ability to store and compute using these data must therefore be addressed by a combination of local, national, and industry resources that exchange large datasets. Maintaining data-intensive flows that comply with the Health Insurance Portability and Accountability Act (HIPAA) and other regulations presents a new challenge for biomedical research. We describe a strategy that marries performance and security by borrowing from and redefining the concept of a Science DMZ, a framework that is used in physical sciences and engineering research to manage high-capacity data flows.By implementing a Medical Science DMZ architecture, biomedical researchers can leverage the scale provided by high-performance computer and cloud storage facilities and national high
Literaure search for intermittent rivers research using ISI Web of Science

Data.gov (United States)

U.S. Environmental Protection Agency — The dataset is the bibliometric information included in the ISI Web of Science database of scientific literature. Table S2 accessible from the dataset link provides...
The Role of Datasets on Scientific Influence within Conflict Research.

Science.gov (United States)

Van Holt, Tracy; Johnson, Jeffery C; Moates, Shiloh; Carley, Kathleen M

2016-01-01

We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving "conflict" in the Web of Science (WoS) over a 66-year period (1945-2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed-such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957-1971 where ideas didn't persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped shape the
Lightweight Advertising and Scalable Discovery of Services, Datasets, and Events Using Feedcasts

Science.gov (United States)

Wilson, B. D.; Ramachandran, R.; Movva, S.

2010-12-01

Broadcast feeds (Atom or RSS) are a mechanism for advertising the existence of new data objects on the web, with metadata and links to further information. Users then subscribe to the feed to receive updates. This concept has already been used to advertise the new granules of science data as they are produced (datacasting), with browse images and metadata, and to advertise bundles of web services (service casting). Structured metadata is introduced into the XML feed format by embedding new XML tags (in defined namespaces), using typed links, and reusing built-in Atom feed elements. This “infocasting” concept can be extended to include many other science artifacts, including data collections, workflow documents, topical geophysical events (hurricanes, forest fires, etc.), natural hazard warnings, and short articles describing a new science result. The common theme is that each infocast contains machine-readable, structured metadata describing the object and enabling further manipulation. For example, service casts contain type links pointing to the service interface description (e.g., WSDL for SOAP services), service endpoint, and human-readable documentation. Our Infocasting project has three main goals: (1) define and evangelize micro-formats (metadata standards) so that providers can easily advertise their web services, datasets, and topical geophysical events by adding structured information to broadcast feeds; (2) develop authoring tools so that anyone can easily author such service advertisements, data casts, and event descriptions; and (3) provide a one-stop, Google-like search box in the browser that allows discovery of service, data and event casts visible on the web, and services & data registered in the GEOSS repository and other NASA repositories (GCMD & ECHO). To demonstrate the event casting idea, a series of micro-articles—with accompanying event casts containing links to relevant datasets, web services, and science analysis workflows--will be
An Annotated Dataset of 14 Meat Images

DEFF Research Database (Denmark)

Stegmann, Mikkel Bille

2002-01-01

This note describes a dataset consisting of 14 annotated images of meat. Points of correspondence are placed on each image. As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given.......This note describes a dataset consisting of 14 annotated images of meat. Points of correspondence are placed on each image. As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given....
Comparison of recent SnIa datasets

International Nuclear Information System (INIS)

Sanchez, J.C. Bueno; Perivolaropoulos, L.; Nesseris, S.

2009-01-01

We rank the six latest Type Ia supernova (SnIa) datasets (Constitution (C), Union (U), ESSENCE (Davis) (E), Gold06 (G), SNLS 1yr (S) and SDSS-II (D)) in the context of the Chevalier-Polarski-Linder (CPL) parametrization w(a) = w 0 +w 1 (1−a), according to their Figure of Merit (FoM), their consistency with the cosmological constant (ΛCDM), their consistency with standard rulers (Cosmic Microwave Background (CMB) and Baryon Acoustic Oscillations (BAO)) and their mutual consistency. We find a significant improvement of the FoM (defined as the inverse area of the 95.4% parameter contour) with the number of SnIa of these datasets ((C) highest FoM, (U), (G), (D), (E), (S) lowest FoM). Standard rulers (CMB+BAO) have a better FoM by about a factor of 3, compared to the highest FoM SnIa dataset (C). We also find that the ranking sequence based on consistency with ΛCDM is identical with the corresponding ranking based on consistency with standard rulers ((S) most consistent, (D), (C), (E), (U), (G) least consistent). The ranking sequence of the datasets however changes when we consider the consistency with an expansion history corresponding to evolving dark energy (w 0 ,w 1 ) = (−1.4,2) crossing the phantom divide line w = −1 (it is practically reversed to (G), (U), (E), (S), (D), (C)). The SALT2 and MLCS2k2 fitters are also compared and some peculiar features of the SDSS-II dataset when standardized with the MLCS2k2 fitter are pointed out. Finally, we construct a statistic to estimate the internal consistency of a collection of SnIa datasets. We find that even though there is good consistency among most samples taken from the above datasets, this consistency decreases significantly when the Gold06 (G) dataset is included in the sample
SIMADL: Simulated Activities of Daily Living Dataset

Directory of Open Access Journals (Sweden)

Talal Alshammari

2018-04-01

Full Text Available With the realisation of the Internet of Things (IoT paradigm, the analysis of the Activities of Daily Living (ADLs, in a smart home environment, is becoming an active research domain. The existence of representative datasets is a key requirement to advance the research in smart home design. Such datasets are an integral part of the visualisation of new smart home concepts as well as the validation and evaluation of emerging machine learning models. Machine learning techniques that can learn ADLs from sensor readings are used to classify, predict and detect anomalous patterns. Such techniques require data that represent relevant smart home scenarios, for training, testing and validation. However, the development of such machine learning techniques is limited by the lack of real smart home datasets, due to the excessive cost of building real smart homes. This paper provides two datasets for classification and anomaly detection. The datasets are generated using OpenSHS, (Open Smart Home Simulator, which is a simulation software for dataset generation. OpenSHS records the daily activities of a participant within a virtual environment. Seven participants simulated their ADLs for different contexts, e.g., weekdays, weekends, mornings and evenings. Eighty-four files in total were generated, representing approximately 63 days worth of activities. Forty-two files of classification of ADLs were simulated in the classification dataset and the other forty-two files are for anomaly detection problems in which anomalous patterns were simulated and injected into the anomaly detection dataset.
The NOAA Dataset Identifier Project

Science.gov (United States)

de la Beaujardiere, J.; Mccullough, H.; Casey, K. S.

2013-12-01

The US National Oceanic and Atmospheric Administration (NOAA) initiated a project in 2013 to assign persistent identifiers to datasets archived at NOAA and to create informational landing pages about those datasets. The goals of this project are to enable the citation of datasets used in products and results in order to help provide credit to data producers, to support traceability and reproducibility, and to enable tracking of data usage and impact. A secondary goal is to encourage the submission of datasets for long-term preservation, because only archived datasets will be eligible for a NOAA-issued identifier. A team was formed with representatives from the National Geophysical, Oceanographic, and Climatic Data Centers (NGDC, NODC, NCDC) to resolve questions including which identifier scheme to use (answer: Digital Object Identifier - DOI), whether or not to embed semantics in identifiers (no), the level of granularity at which to assign identifiers (as coarsely as reasonable), how to handle ongoing time-series data (do not break into chunks), creation mechanism for the landing page (stylesheet from formal metadata record preferred), and others. Decisions made and implementation experience gained will inform the writing of a Data Citation Procedural Directive to be issued by the Environmental Data Management Committee in 2014. Several identifiers have been issued as of July 2013, with more on the way. NOAA is now reporting the number as a metric to federal Open Government initiatives. This paper will provide further details and status of the project.
Control Measure Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — The EPA Control Measure Dataset is a collection of documents describing air pollution control available to regulated facilities for the control and abatement of air...
The Kinetics Human Action Video Dataset

OpenAIRE

Kay, Will; Carreira, Joao; Simonyan, Karen; Zhang, Brian; Hillier, Chloe; Vijayanarasimhan, Sudheendra; Viola, Fabio; Green, Tim; Back, Trevor; Natsev, Paul; Suleyman, Mustafa; Zisserman, Andrew

2017-01-01

We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some ...
Data-Oriented Astrophysics at NOAO: The Science Archive & The Data Lab

Science.gov (United States)

Juneau, Stephanie; NOAO Data Lab, NOAO Science Archive

2018-06-01

As we keep progressing into an era of increasingly large astronomy datasets, NOAO’s data-oriented mission is growing in prominence. The NOAO Science Archive, which captures and processes the pixel data from mountaintops in Chile and Arizona, now contains holdings at Petabyte scales. Working at the intersection of astronomy and data science, the main goal of the NOAO Data Lab is to provide users with a suite of tools to work close to this data, the catalogs derived from them, as well as externally provided datasets, and thus optimize the scientific productivity of the astronomy community. These tools and services include databases, query tools, virtual storage space, workflows through our Jupyter Notebook server, and scripted analysis. We currently host datasets from NOAO facilities such as the Dark Energy Survey (DES), the DESI imaging Legacy Surveys (LS), the Dark Energy Camera Plane Survey (DECaPS), and the nearly all-sky NOAO Source Catalog (NSC). We are further preparing for large spectroscopy datasets such as DESI. After a brief overview of the Science Archive, the Data Lab and datasets, I will briefly showcase scientific applications showing use of our data holdings. Lastly, I will describe our vision for future developments as we tackle the next technical and scientific challenges.
Comparison of CORA and EN4 in-situ datasets validation methods, toward a better quality merged dataset.

Science.gov (United States)

Szekely, Tanguy; Killick, Rachel; Gourrion, Jerome; Reverdin, Gilles

2017-04-01

CORA and EN4 are both global delayed time mode validated in-situ ocean temperature and salinity datasets distributed by the Met Office (http://www.metoffice.gov.uk/) and Copernicus (www.marine.copernicus.eu). A large part of the profiles distributed by CORA and EN4 in recent years are Argo profiles from the ARGO DAC, but profiles are also extracted from the World Ocean Database and TESAC profiles from GTSPP. In the case of CORA, data coming from the EUROGOOS Regional operationnal oserving system( ROOS) operated by European institutes no managed by National Data Centres and other datasets of profiles povided by scientific sources can also be found (Sea mammals profiles from MEOP, XBT datasets from cruises ...). (EN4 also takes data from the ASBO dataset to supplement observations in the Arctic). First advantage of this new merge product is to enhance the space and time coverage at global and european scales for the period covering 1950 till a year before the current year. This product is updated once a year and T&S gridded fields are alos generated for the period 1990-year n-1. The enhancement compared to the revious CORA product will be presented Despite the fact that the profiles distributed by both datasets are mostly the same, the quality control procedures developed by the Met Office and Copernicus teams differ, sometimes leading to different quality control flags for the same profile. Started in 2016 a new study started that aims to compare both validation procedures to move towards a Copernicus Marine Service dataset with the best features of CORA and EN4 validation.A reference data set composed of the full set of in-situ temperature and salinity measurements collected by Coriolis during 2015 is used. These measurements have been made thanks to wide range of instruments (XBTs, CTDs, Argo floats, Instrumented sea mammals,...), covering the global ocean. The reference dataset has been validated simultaneously by both teams.An exhaustive comparison of the
Associating uncertainty with datasets using Linked Data and allowing propagation via provenance chains

Science.gov (United States)

Car, Nicholas; Cox, Simon; Fitch, Peter

2015-04-01

With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty
Publishing datasets with eSciDoc and panMetaDocs

Science.gov (United States)

Ulbricht, D.; Klump, J.; Bertelmann, R.

2012-04-01

publishing scientific datasets as electronic data supplements to research papers. Publication of research manuscripts has an already well established workflow that shares junctures with other processes and involves several parties in the process of dataset publication. Activities of the author, the reviewer, the print publisher and the data publisher have to be coordinated into a common data publication workflow. The case of data publication at GFZ Potsdam displays some specifics, e.g. the DOIDB webservice. The DOIDB is a proxy service at GFZ for the DataCite [4] DOI registration and its metadata store. DOIDB provides a local summary of the dataset DOIs registered through GFZ as a publication agent. An additional use case for the DOIDB is its function to enrich the datacite metadata with additional custom attributes, like a geographic reference in a DIF record. These attributes are at the moment not available in the datacite metadata schema but would be valuable elements for the compilation of data catalogues in the earth sciences and for dissemination of catalogue data via OAI-PMH. [1] http://www.escidoc.org , eSciDoc, FIZ Karlruhe, Germany [2] http://panmetadocs.sf.net , panMetaDocs, GFZ Potsdam, Germany [3] http://metaworks.pangaea.de , panMetaWorks, Dr. R. Huber, MARUM, Univ. Bremen, Germany [4] http://www.datacite.org
Provenance of Earth Science Datasets - How Deep Should One Go?

Science.gov (United States)

Ramapriyan, H.; Manipon, G. J. M.; Aulenbach, S.; Duggan, B.; Goldstein, J.; Hua, H.; Tan, D.; Tilmes, C.; Wilson, B. D.; Wolfe, R.; Zednik, S.

2015-12-01

For credibility of scientific research, transparency and reproducibility are essential. This fundamental tenet has been emphasized for centuries, and has been receiving increased attention in recent years. The Office of Management and Budget (2002) addressed reproducibility and other aspects of quality and utility of information from federal agencies. Specific guidelines from NASA (2002) are derived from the above. According to these guidelines, "NASA requires a higher standard of quality for information that is considered influential. Influential scientific, financial, or statistical information is defined as NASA information that, when disseminated, will have or does have clear and substantial impact on important public policies or important private sector decisions." For information to be compliant, "the information must be transparent and reproducible to the greatest possible extent." We present how the principles of transparency and reproducibility have been applied to NASA data supporting the Third National Climate Assessment (NCA3). The depth of trace needed of provenance of data used to derive conclusions in NCA3 depends on how the data were used (e.g., qualitatively or quantitatively). Given that the information is diligently maintained in the agency archives, it is possible to trace from a figure in the publication through the datasets, specific files, algorithm versions, instruments used for data collection, and satellites, as well as the individuals and organizations involved in each step. Such trace back permits transparency and reproducibility.

Fluxnet Synthesis Dataset Collaboration Infrastructure

Energy Technology Data Exchange (ETDEWEB)

Agarwal, Deborah A. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Humphrey, Marty [Univ. of Virginia, Charlottesville, VA (United States); van Ingen, Catharine [Microsoft. San Francisco, CA (United States); Beekwilder, Norm [Univ. of Virginia, Charlottesville, VA (United States); Goode, Monte [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Jackson, Keith [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Rodriguez, Matt [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Weber, Robin [Univ. of California, Berkeley, CA (United States)

2008-02-06

The Fluxnet synthesis dataset originally compiled for the La Thuile workshop contained approximately 600 site years. Since the workshop, several additional site years have been added and the dataset now contains over 920 site years from over 240 sites. A data refresh update is expected to increase those numbers in the next few months. The ancillary data describing the sites continues to evolve as well. There are on the order of 120 site contacts and 60proposals have been approved to use thedata. These proposals involve around 120 researchers. The size and complexity of the dataset and collaboration has led to a new approach to providing access to the data and collaboration support and the support team attended the workshop and worked closely with the attendees and the Fluxnet project office to define the requirements for the support infrastructure. As a result of this effort, a new website (http://www.fluxdata.org) has been created to provide access to the Fluxnet synthesis dataset. This new web site is based on a scientific data server which enables browsing of the data on-line, data download, and version tracking. We leverage database and data analysis tools such as OLAP data cubes and web reports to enable browser and Excel pivot table access to the data.
Simulation of Smart Home Activity Datasets

Directory of Open Access Journals (Sweden)

Jonathan Synnott

2015-06-01

Full Text Available A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
Simulation of Smart Home Activity Datasets.

Science.gov (United States)

Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

2015-06-16

A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
Solar Integration National Dataset Toolkit | Grid Modernization | NREL

Science.gov (United States)

Solar Integration National Dataset Toolkit Solar Integration National Dataset Toolkit NREL is working on a Solar Integration National Dataset (SIND) Toolkit to enable researchers to perform U.S . regional solar generation integration studies. It will provide modeled, coherent subhourly solar power data
The Role of Datasets on Scientific Influence within Conflict Research.

Directory of Open Access Journals (Sweden)

Tracy Van Holt

Full Text Available We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving "conflict" in the Web of Science (WoS over a 66-year period (1945-2011. We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA, a specialized social network analysis on this citation network (~1.5 million works, to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed-such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957-1971 where ideas didn't persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993. The critical path consisted of a number of key features: 1 Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2 Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3 We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography. Publically available conflict datasets developed early on helped
The Role of Datasets on Scientific Influence within Conflict Research

Science.gov (United States)

Van Holt, Tracy; Johnson, Jeffery C.; Moates, Shiloh; Carley, Kathleen M.

2016-01-01

We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving “conflict” in the Web of Science (WoS) over a 66-year period (1945–2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed—such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957–1971 where ideas didn’t persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped
PROVIDING GEOGRAPHIC DATASETS AS LINKED DATA IN SDI

Directory of Open Access Journals (Sweden)

E. Hietanen

2016-06-01

Full Text Available In this study, a prototype service to provide data from Web Feature Service (WFS as linked data is implemented. At first, persistent and unique Uniform Resource Identifiers (URI are created to all spatial objects in the dataset. The objects are available from those URIs in Resource Description Framework (RDF data format. Next, a Web Ontology Language (OWL ontology is created to describe the dataset information content using the Open Geospatial Consortium’s (OGC GeoSPARQL vocabulary. The existing data model is modified in order to take into account the linked data principles. The implemented service produces an HTTP response dynamically. The data for the response is first fetched from existing WFS. Then the Geographic Markup Language (GML format output of the WFS is transformed on-the-fly to the RDF format. Content Negotiation is used to serve the data in different RDF serialization formats. This solution facilitates the use of a dataset in different applications without replicating the whole dataset. In addition, individual spatial objects in the dataset can be referred with URIs. Furthermore, the needed information content of the objects can be easily extracted from the RDF serializations available from those URIs. A solution for linking data objects to the dataset URI is also introduced by using the Vocabulary of Interlinked Datasets (VoID. The dataset is divided to the subsets and each subset is given its persistent and unique URI. This enables the whole dataset to be explored with a web browser and all individual objects to be indexed by search engines.
Wind Integration National Dataset Toolkit | Grid Modernization | NREL

Science.gov (United States)

Integration National Dataset Toolkit Wind Integration National Dataset Toolkit The Wind Integration National Dataset (WIND) Toolkit is an update and expansion of the Eastern Wind Integration Data Set and Western Wind Integration Data Set. It supports the next generation of wind integration studies. WIND
The medical science DMZ: a network design pattern for data-intensive medical science.

Science.gov (United States)

Peisert, Sean; Dart, Eli; Barnett, William; Balas, Edward; Cuff, James; Grossman, Robert L; Berman, Ari; Shankar, Anurag; Tierney, Brian

2017-10-06

We describe a detailed solution for maintaining high-capacity, data-intensive network flows (eg, 10, 40, 100 Gbps+) in a scientific, medical context while still adhering to security and privacy laws and regulations. High-end networking, packet-filter firewalls, network intrusion-detection systems. We describe a "Medical Science DMZ" concept as an option for secure, high-volume transport of large, sensitive datasets between research institutions over national research networks, and give 3 detailed descriptions of implemented Medical Science DMZs. The exponentially increasing amounts of "omics" data, high-quality imaging, and other rapidly growing clinical datasets have resulted in the rise of biomedical research "Big Data." The storage, analysis, and network resources required to process these data and integrate them into patient diagnoses and treatments have grown to scales that strain the capabilities of academic health centers. Some data are not generated locally and cannot be sustained locally, and shared data repositories such as those provided by the National Library of Medicine, the National Cancer Institute, and international partners such as the European Bioinformatics Institute are rapidly growing. The ability to store and compute using these data must therefore be addressed by a combination of local, national, and industry resources that exchange large datasets. Maintaining data-intensive flows that comply with the Health Insurance Portability and Accountability Act (HIPAA) and other regulations presents a new challenge for biomedical research. We describe a strategy that marries performance and security by borrowing from and redefining the concept of a Science DMZ, a framework that is used in physical sciences and engineering research to manage high-capacity data flows. By implementing a Medical Science DMZ architecture, biomedical researchers can leverage the scale provided by high-performance computer and cloud storage facilities and national high
clearScience: Infrastructure for Communicating Data-Intensive Science.

Science.gov (United States)

Bot, Brian M; Burdick, David; Kellen, Michael; Huang, Erich S

2013-01-01

Progress in biomedical research requires effective scientific communication to one's peers and to the public. Current research routinely encompasses large datasets and complex analytic processes, and the constraints of traditional journal formats limit useful transmission of these elements. We are constructing a framework through which authors can not only provide the narrative of what was done, but the primary and derivative data, the source code, the compute environment, and web-accessible virtual machines. This infrastructure allows authors to "hand their machine"- prepopulated with libraries, data, and code-to those interested in reviewing or building off of their work. This project, "clearScience," seeks to provide an integrated system that accommodates the ad hoc nature of discovery in the data-intensive sciences and seamless transitions from working to reporting. We demonstrate that rather than merely describing the science being reported, one can deliver the science itself.
A New Outlier Detection Method for Multidimensional Datasets

KAUST Repository

Abdel Messih, Mario A.

2012-07-01

This study develops a novel hybrid method for outlier detection (HMOD) that combines the idea of distance based and density based methods. The proposed method has two main advantages over most of the other outlier detection methods. The first advantage is that it works well on both dense and sparse datasets. The second advantage is that, unlike most other outlier detection methods that require careful parameter setting and prior knowledge of the data, HMOD is not very sensitive to small changes in parameter values within certain parameter ranges. The only required parameter to set is the number of nearest neighbors. In addition, we made a fully parallelized implementation of HMOD that made it very efficient in applications. Moreover, we proposed a new way of using the outlier detection for redundancy reduction in datasets where the confidence level that evaluates how accurate the less redundant dataset can be used to represent the original dataset can be specified by users. HMOD is evaluated on synthetic datasets (dense and mixed “dense and sparse”) and a bioinformatics problem of redundancy reduction of dataset of position weight matrices (PWMs) of transcription factor binding sites. In addition, in the process of assessing the performance of our redundancy reduction method, we developed a simple tool that can be used to evaluate the confidence level of reduced dataset representing the original dataset. The evaluation of the results shows that our method can be used in a wide range of problems.
NASA's Earth Science Use of Commercially Availiable Remote Sensing Datasets: Cover Image

Science.gov (United States)

Underwood, Lauren W.; Goward, Samuel N.; Fearon, Matthew G.; Fletcher, Rose; Garvin, Jim; Hurtt, George

2008-01-01

The cover image incorporates high resolution stereo pairs acquired from the DigitalGlobe(R) QuickBird sensor. It shows a digital elevation model of Meteor Crater, Arizona at approximately 1.3 meter point-spacing. Image analysts used the Leica Photogrammetry Suite to produce the DEM. The outside portion was computed from two QuickBird panchromatic scenes acquired October 2006, while an Optech laser scan dataset was used for the crater s interior elevations. The crater s terrain model and image drape were created in a NASA Constellation Program project focused on simulating lunar surface environments for prototyping and testing lunar surface mission analysis and planning tools. This work exemplifies NASA s Scientific Data Purchase legacy and commercial high resolution imagery applications, as scientists use commercial high resolution data to examine lunar analog Earth landscapes for advanced planning and trade studies for future lunar surface activities. Other applications include landscape dynamics related to volcanism, hydrologic events, climate change, and ice movement.
Analysis of Science and Technology Trend Based on Word Usage in Digitized Books

Science.gov (United States)

Yun, Jinhyuk; Kim, Pan-Jun; Jeong, Hawoong

2013-03-01

Throughout mankind's history, forecasting and predicting future has been a long-lasting interest to our society. Many fortune-tellers have tried to forecast the future by ``divine'' items. Sci-fi writers have also imagined what the future would look like. However most of them have been illogical and unscientific. Meanwhile, scientists have also attempted to discover future trend of science. Many researchers have used quantitative models to study how new ideas are used and spread. Besides the modeling works, in the early 21st century, the rise of data science has provided another prospect of forecasting future. However many studies have focused on very limited set of period or age, due to the limitations of dataset. Hence, many questions still remained unanswered. Fortunately, Google released a new dataset named ``Google N-Gram Dataset.'' This dataset provides us with 5 million words worth of literature dating from 1520 to 2008, and this is nearly 4% of publications ever printed. With this new time-varying dataset, we studied the spread and development of technologies by searching ``Science and Technology'' related words from 1800 to 2000. By statistical analysis, some general scaling laws were discovered. And finally, we determined factors that strongly affect the lifecycle of a word.
Bulk Data Movement for Climate Dataset: Efficient Data Transfer Management with Dynamic Transfer Adjustment

International Nuclear Information System (INIS)

Sim, Alexander; Balman, Mehmet; Williams, Dean; Shoshani, Arie; Natarajan, Vijaya

2010-01-01

Many scientific applications and experiments, such as high energy and nuclear physics, astrophysics, climate observation and modeling, combustion, nano-scale material sciences, and computational biology, generate extreme volumes of data with a large number of files. These data sources are distributed among national and international data repositories, and are shared by large numbers of geographically distributed scientists. A large portion of data is frequently accessed, and a large volume of data is moved from one place to another for analysis and storage. One challenging issue in such efforts is the limited network capacity for moving large datasets to explore and manage. The Bulk Data Mover (BDM), a data transfer management tool in the Earth System Grid (ESG) community, has been managing the massive dataset transfers efficiently with the pre-configured transfer properties in the environment where the network bandwidth is limited. Dynamic transfer adjustment was studied to enhance the BDM to handle significant end-to-end performance changes in the dynamic network environment as well as to control the data transfers for the desired transfer performance. We describe the results from the BDM transfer management for the climate datasets. We also describe the transfer estimation model and results from the dynamic transfer adjustment.
NP-PAH Interaction Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration...
A dataset on tail risk of commodities markets.

Science.gov (United States)

Powell, Robert J; Vo, Duc H; Pham, Thach N; Singh, Abhay K

2017-12-01

This article contains the datasets related to the research article "The long and short of commodity tails and their relationship to Asian equity markets"(Powell et al., 2017) [1]. The datasets contain the daily prices (and price movements) of 24 different commodities decomposed from the S&P GSCI index and the daily prices (and price movements) of three share market indices including World, Asia, and South East Asia for the period 2004-2015. Then, the dataset is divided into annual periods, showing the worst 5% of price movements for each year. The datasets are convenient to examine the tail risk of different commodities as measured by Conditional Value at Risk (CVaR) as well as their changes over periods. The datasets can also be used to investigate the association between commodity markets and share markets.
Proteomics dataset

DEFF Research Database (Denmark)

Bennike, Tue Bjerg; Carlsen, Thomas Gelsing; Ellingsen, Torkell

2017-01-01

patients (Morgan et al., 2012; Abraham and Medzhitov, 2011; Bennike, 2014) [8–10. Therefore, we characterized the proteome of colon mucosa biopsies from 10 inflammatory bowel disease ulcerative colitis (UC) patients, 11 gastrointestinal healthy rheumatoid arthritis (RA) patients, and 10 controls. We...... been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD001608 for ulcerative colitis and control samples, and PXD003082 for rheumatoid arthritis samples....
Comparison of Shallow Survey 2012 Multibeam Datasets

Science.gov (United States)

Ramirez, T. M.

2012-12-01

The purpose of the Shallow Survey common dataset is a comparison of the different technologies utilized for data acquisition in the shallow survey marine environment. The common dataset consists of a series of surveys conducted over a common area of seabed using a variety of systems. It provides equipment manufacturers the opportunity to showcase their latest systems while giving hydrographic researchers and scientists a chance to test their latest algorithms on the dataset so that rigorous comparisons can be made. Five companies collected data for the Common Dataset in the Wellington Harbor area in New Zealand between May 2010 and May 2011; including Kongsberg, Reson, R2Sonic, GeoAcoustics, and Applied Acoustics. The Wellington harbor and surrounding coastal area was selected since it has a number of well-defined features, including the HMNZS South Seas and HMNZS Wellington wrecks, an armored seawall constructed of Tetrapods and Akmons, aquifers, wharves and marinas. The seabed inside the harbor basin is largely fine-grained sediment, with gravel and reefs around the coast. The area outside the harbor on the southern coast is an active environment, with moving sand and exposed reefs. A marine reserve is also in this area. For consistency between datasets, the coastal research vessel R/V Ikatere and crew were used for all surveys conducted for the common dataset. Using Triton's Perspective processing software multibeam datasets collected for the Shallow Survey were processed for detail analysis. Datasets from each sonar manufacturer were processed using the CUBE algorithm developed by the Center for Coastal and Ocean Mapping/Joint Hydrographic Center (CCOM/JHC). Each dataset was gridded at 0.5 and 1.0 meter resolutions for cross comparison and compliance with International Hydrographic Organization (IHO) requirements. Detailed comparisons were made of equipment specifications (transmit frequency, number of beams, beam width), data density, total uncertainty, and
National Hydrography Dataset (NHD)

Data.gov (United States)

Kansas Data Access and Support Center — The National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that comprise the...
The case for developing publicly-accessible datasets for health services research in the Middle East and North Africa (MENA region

Directory of Open Access Journals (Sweden)

El-Jardali Fadi

2009-10-01

Full Text Available Abstract Background The existence of publicly-accessible datasets comprised a significant opportunity for health services research to evolve into a science that supports health policy making and evaluation, proper inter- and intra-organizational decisions and optimal clinical interventions. This paper investigated the role of publicly-accessible datasets in the enhancement of health care systems in the developed world and highlighted the importance of their wide existence and use in the Middle East and North Africa (MENA region. Discussion A search was conducted to explore the availability of publicly-accessible datasets in the MENA region. Although datasets were found in most countries in the region, those were limited in terms of their relevance, quality and public-accessibility. With rare exceptions, publicly-accessible datasets - as present in the developed world - were absent. Based on this, we proposed a gradual approach and a set of recommendations to promote the development and use of publicly-accessible datasets in the region. These recommendations target potential actions by governments, researchers, policy makers and international organizations. Summary We argue that the limited number of publicly-accessible datasets in the MENA region represents a lost opportunity for the evidence-based advancement of health systems in the region. The availability and use of publicly-accessible datasets would encourage policy makers in this region to base their decisions on solid representative data and not on estimates or small-scale studies; researchers would be able to exercise their expertise in a meaningful manner to both, policy makers and the public. The population of the MENA countries would exercise the right to benefit from locally- or regionally-based studies, versus imported and in 'best cases' customized ones. Furthermore, on a macro scale, the availability of regionally comparable publicly-accessible datasets would allow for the

The Harvard organic photovoltaic dataset.

Science.gov (United States)

Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán

2016-09-27

The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.
Utilizing the Antarctic Master Directory to find orphan datasets

Science.gov (United States)

Bonczkowski, J.; Carbotte, S. M.; Arko, R. A.; Grebas, S. K.

2011-12-01

While most Antarctic data are housed at an established disciplinary-specific data repository, there are data types for which no suitable repository exists. In some cases, these "orphan" data, without an appropriate national archive, are served from local servers by the principal investigators who produced the data. There are many pitfalls with data served privately, including the frequent lack of adequate documentation to ensure the data can be understood by others for re-use and the impermanence of personal web sites. For example, if an investigator leaves an institution and the data moves, the link published is no longer accessible. To ensure continued availability of data, submission to long-term national data repositories is needed. As stated in the National Science Foundation Office of Polar Programs (NSF/OPP) Guidelines and Award Conditions for Scientific Data, investigators are obligated to submit their data for curation and long-term preservation; this includes the registration of a dataset description into the Antarctic Master Directory (AMD), http://gcmd.nasa.gov/Data/portals/amd/. The AMD is a Web-based, searchable directory of thousands of dataset descriptions, known as DIF records, submitted by scientists from over 20 countries. It serves as a node of the International Directory Network/Global Change Master Directory (IDN/GCMD). The US Antarctic Program Data Coordination Center (USAP-DCC), http://www.usap-data.org/, funded through NSF/OPP, was established in 2007 to help streamline the process of data submission and DIF record creation. When data does not quite fit within any existing disciplinary repository, it can be registered within the USAP-DCC as the fallback data repository. Within the scope of the USAP-DCC we undertook the challenge of discovering and "rescuing" orphan datasets currently registered within the AMD. In order to find which DIF records led to data served privately, all records relating to US data within the AMD were parsed. After
Tables and figure datasets

Data.gov (United States)

U.S. Environmental Protection Agency — Soil and air concentrations of asbestos in Sumas study. This dataset is associated with the following publication: Wroble, J., T. Frederick, A. Frame, and D....
Application of data science tools to quantify and distinguish between structures and models in molecular dynamics datasets.

Science.gov (United States)

Kalidindi, Surya R; Gomberg, Joshua A; Trautt, Zachary T; Becker, Chandler A

2015-08-28

Structure quantification is key to successful mining and extraction of core materials knowledge from both multiscale simulations as well as multiscale experiments. The main challenge stems from the need to transform the inherently high dimensional representations demanded by the rich hierarchical material structure into useful, high value, low dimensional representations. In this paper, we develop and demonstrate the merits of a data-driven approach for addressing this challenge at the atomic scale. The approach presented here is built on prior successes demonstrated for mesoscale representations of material internal structure, and involves three main steps: (i) digital representation of the material structure, (ii) extraction of a comprehensive set of structure measures using the framework of n-point spatial correlations, and (iii) identification of data-driven low dimensional measures using principal component analyses. These novel protocols, applied on an ensemble of structure datasets output from molecular dynamics (MD) simulations, have successfully classified the datasets based on several model input parameters such as the interatomic potential and the temperature used in the MD simulations.
Application of data science tools to quantify and distinguish between structures and models in molecular dynamics datasets

International Nuclear Information System (INIS)

Kalidindi, Surya R; Gomberg, Joshua A; Trautt, Zachary T; Becker, Chandler A

2015-01-01

Structure quantification is key to successful mining and extraction of core materials knowledge from both multiscale simulations as well as multiscale experiments. The main challenge stems from the need to transform the inherently high dimensional representations demanded by the rich hierarchical material structure into useful, high value, low dimensional representations. In this paper, we develop and demonstrate the merits of a data-driven approach for addressing this challenge at the atomic scale. The approach presented here is built on prior successes demonstrated for mesoscale representations of material internal structure, and involves three main steps: (i) digital representation of the material structure, (ii) extraction of a comprehensive set of structure measures using the framework of n-point spatial correlations, and (iii) identification of data-driven low dimensional measures using principal component analyses. These novel protocols, applied on an ensemble of structure datasets output from molecular dynamics (MD) simulations, have successfully classified the datasets based on several model input parameters such as the interatomic potential and the temperature used in the MD simulations. (paper)
Democratizing data science through data science training.

Science.gov (United States)

Van Horn, John Darrell; Fierro, Lily; Kamdar, Jeana; Gordon, Jonathan; Stewart, Crystal; Bhattrai, Avnish; Abe, Sumiko; Lei, Xiaoxiao; O'Driscoll, Caroline; Sinha, Aakanchha; Jain, Priyambada; Burns, Gully; Lerman, Kristina; Ambite, José Luis

2018-01-01

The biomedical sciences have experienced an explosion of data which promises to overwhelm many current practitioners. Without easy access to data science training resources, biomedical researchers may find themselves unable to wrangle their own datasets. In 2014, to address the challenges posed such a data onslaught, the National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative. To this end, the BD2K Training Coordinating Center (TCC; bigdatau.org) was funded to facilitate both in-person and online learning, and open up the concepts of data science to the widest possible audience. Here, we describe the activities of the BD2K TCC and its focus on the construction of the Educational Resource Discovery Index (ERuDIte), which identifies, collects, describes, and organizes online data science materials from BD2K awardees, open online courses, and videos from scientific lectures and tutorials. ERuDIte now indexes over 9,500 resources. Given the richness of online training materials and the constant evolution of biomedical data science, computational methods applying information retrieval, natural language processing, and machine learning techniques are required - in effect, using data science to inform training in data science. In so doing, the TCC seeks to democratize novel insights and discoveries brought forth via large-scale data science training.
PHYSICS PERFORMANCE AND DATASET (PPD)

CERN Multimedia

L. Silvestris

2013-01-01

The first part of the Long Shutdown period has been dedicated to the preparation of the samples for the analysis targeting the summer conferences. In particular, the 8 TeV data acquired in 2012, including most of the “parked datasets”, have been reconstructed profiting from improved alignment and calibration conditions for all the sub-detectors. A careful planning of the resources was essential in order to deliver the datasets well in time to the analysts, and to schedule the update of all the conditions and calibrations needed at the analysis level. The newly reprocessed data have undergone detailed scrutiny by the Dataset Certification team allowing to recover some of the data for analysis usage and further improving the certification efficiency, which is now at 91% of the recorded luminosity. With the aim of delivering a consistent dataset for 2011 and 2012, both in terms of conditions and release (53X), the PPD team is now working to set up a data re-reconstruction and a new MC pro...
Integrated Surface Dataset (Global)

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — The Integrated Surface (ISD) Dataset (ISD) is composed of worldwide surface weather observations from over 35,000 stations, though the best spatial coverage is...
Aaron Journal article datasets

Data.gov (United States)

U.S. Environmental Protection Agency — All figures used in the journal article are in netCDF format. This dataset is associated with the following publication: Sims, A., K. Alapaty , and S. Raman....
Market Squid Ecology Dataset

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — This dataset contains ecological information collected on the major adult spawning and juvenile habitats of market squid off California and the US Pacific Northwest....
Uncertainty Assessment of the NASA Earth Exchange Global Daily Downscaled Climate Projections (NEX-GDDP) Dataset

Science.gov (United States)

Wang, Weile; Nemani, Ramakrishna R.; Michaelis, Andrew; Hashimoto, Hirofumi; Dungan, Jennifer L.; Thrasher, Bridget L.; Dixon, Keith W.

2016-01-01

The NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset is comprised of downscaled climate projections that are derived from 21 General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) and across two of the four greenhouse gas emissions scenarios (RCP4.5 and RCP8.5). Each of the climate projections includes daily maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2100 and the spatial resolution is 0.25 degrees (approximately 25 km x 25 km). The GDDP dataset has received warm welcome from the science community in conducting studies of climate change impacts at local to regional scales, but a comprehensive evaluation of its uncertainties is still missing. In this study, we apply the Perfect Model Experiment framework (Dixon et al. 2016) to quantify the key sources of uncertainties from the observational baseline dataset, the downscaling algorithm, and some intrinsic assumptions (e.g., the stationary assumption) inherent to the statistical downscaling techniques. We developed a set of metrics to evaluate downscaling errors resulted from bias-correction ("quantile-mapping"), spatial disaggregation, as well as the temporal-spatial non-stationarity of climate variability. Our results highlight the spatial disaggregation (or interpolation) errors, which dominate the overall uncertainties of the GDDP dataset, especially over heterogeneous and complex terrains (e.g., mountains and coastal area). In comparison, the temporal errors in the GDDP dataset tend to be more constrained. Our results also indicate that the downscaled daily precipitation also has relatively larger uncertainties than the temperature fields, reflecting the rather stochastic nature of precipitation in space. Therefore, our results provide insights in improving statistical downscaling algorithms and products in the future.
ATLAS File and Dataset Metadata Collection and Use

CERN Document Server

Albrand, S; The ATLAS collaboration; Lambert, F; Gallas, E J

2012-01-01

The ATLAS Metadata Interface (“AMI”) was designed as a generic cataloguing system, and as such it has found many uses in the experiment including software release management, tracking of reconstructed event sizes and control of dataset nomenclature. The primary use of AMI is to provide a catalogue of datasets (file collections) which is searchable using physics criteria. In this paper we discuss the various mechanisms used for filling the AMI dataset and file catalogues. By correlating information from different sources we can derive aggregate information which is important for physics analysis; for example the total number of events contained in dataset, and possible reasons for missing events such as a lost file. Finally we will describe some specialized interfaces which were developed for the Data Preparation and reprocessing coordinators. These interfaces manipulate information from both the dataset domain held in AMI, and the run-indexed information held in the ATLAS COMA application (Conditions and ...
Norwegian Hydrological Reference Dataset for Climate Change Studies

Energy Technology Data Exchange (ETDEWEB)

Magnussen, Inger Helene; Killingland, Magnus; Spilde, Dag

2012-07-01

Based on the Norwegian hydrological measurement network, NVE has selected a Hydrological Reference Dataset for studies of hydrological change. The dataset meets international standards with high data quality. It is suitable for monitoring and studying the effects of climate change on the hydrosphere and cryosphere in Norway. The dataset includes streamflow, groundwater, snow, glacier mass balance and length change, lake ice and water temperature in rivers and lakes.(Author)
The Harvard organic photovoltaic dataset

Science.gov (United States)

Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán

2016-01-01

The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312
Synthetic and Empirical Capsicum Annuum Image Dataset

NARCIS (Netherlands)

Barth, R.

2016-01-01

This dataset consists of per-pixel annotated synthetic (10500) and empirical images (50) of Capsicum annuum, also known as sweet or bell pepper, situated in a commercial greenhouse. Furthermore, the source models to generate the synthetic images are included. The aim of the datasets are to
Citizen science datasets reveal drivers of spatial and temporal variation for anthropogenic litter on Great Lakes beaches.

Science.gov (United States)

Vincent, Anna; Drag, Nate; Lyandres, Olga; Neville, Sarah; Hoellein, Timothy

2017-01-15

Accumulation of anthropogenic litter (AL) on marine beaches and its ecological effects have been a major focus of research. Recent studies suggest AL is also abundant in freshwater environments, but much less research has been conducted in freshwaters relative to oceans. The Adopt-a-BeachTM (AAB) program, administered by the Alliance for the Great Lakes, organizes volunteers to act as citizen scientists by collecting and maintaining data on AL abundance on Great Lakes beaches. Initial assessments of the AAB records quantified sources and abundance of AL on Lake Michigan beaches, and showed that plastic AL was >75% of AL on beaches across all five Great Lakes. However, AAB records have not yet been used to examine patterns of AL density and composition among beaches of all different substrate types (e.g., parks, rocky, sandy), across land-use categories (e.g., rural, suburban, urban), or among seasons (i.e., spring, summer, and fall). We found that most AL on beaches are consumer goods that most likely originate from beach visitors and nearby urban environments, rather than activities such as shipping, fishing, or illegal dumping. We also demonstrated that urban beaches and those with sand rather than rocks had higher AL density relative to other sites. Finally, we found that AL abundance is lowest during the summer, between the US holidays of Memorial Day (last Monday in May) and Labor Day (first Monday in September) at the urban beaches, while other beaches showed no seasonality. This research is a model for utilizing datasets collected by volunteers involved in citizen science programs, and will contribute to AL management by offering priorities for AL types and locations to maximize AL reduction. Copyright © 2016 Elsevier B.V. All rights reserved.
EEG datasets for motor imagery brain-computer interface.

Science.gov (United States)

Cho, Hohyun; Ahn, Minkyu; Ahn, Sangtae; Kwon, Moonyoung; Jun, Sung Chan

2017-07-01

Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states. © The Authors 2017. Published by Oxford University Press.
A high-resolution European dataset for hydrologic modeling

Science.gov (United States)

Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

2013-04-01

There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as
ASSISTments Dataset from Multiple Randomized Controlled Experiments

Science.gov (United States)

Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil

2016-01-01

In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…
Earth Science Data Analytics: Bridging Tools and Techniques with the Co-Analysis of Large, Heterogeneous Datasets

Science.gov (United States)

Kempler, Steve; Mathews, Tiffany

2016-01-01

The continuum of ever-evolving data management systems affords great opportunities to the enhancement of knowledge and facilitation of science research. To take advantage of these opportunities, it is essential to understand and develop methods that enable data relationships to be examined and the information to be manipulated. This presentation describes the efforts of the Earth Science Information Partners (ESIP) Federation Earth Science Data Analytics (ESDA) Cluster to understand, define, and facilitate the implementation of ESDA to advance science research. As a result of the void of Earth science data analytics publication material, the cluster has defined ESDA along with 10 goals to set the framework for a common understanding of tools and techniques that are available and still needed to support ESDA.

Estimating parameters for probabilistic linkage of privacy-preserved datasets.

Science.gov (United States)

Brown, Adrian P; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Boyd, James H

2017-07-10

Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher
Viking Seismometer PDS Archive Dataset

Science.gov (United States)

Lorenz, R. D.

2016-12-01

The Viking Lander 2 seismometer operated successfully for over 500 Sols on the Martian surface, recording at least one likely candidate Marsquake. The Viking mission, in an era when data handling hardware (both on board and on the ground) was limited in capability, predated modern planetary data archiving, and ad-hoc repositories of the data, and the very low-level record at NSSDC, were neither convenient to process nor well-known. In an effort supported by the NASA Mars Data Analysis Program, we have converted the bulk of the Viking dataset (namely the 49,000 and 270,000 records made in High- and Event- modes at 20 and 1 Hz respectively) into a simple ASCII table format. Additionally, since wind-generated lander motion is a major component of the signal, contemporaneous meteorological data are included in summary records to facilitate correlation. These datasets are being archived at the PDS Geosciences Node. In addition to brief instrument and dataset descriptions, the archive includes code snippets in the freely-available language 'R' to demonstrate plotting and analysis. Further, we present examples of lander-generated noise, associated with the sampler arm, instrument dumps and other mechanical operations.
Integrating Intelligent Systems Domain Knowledge Into the Earth Science Curricula

Science.gov (United States)

Güereque, M.; Pennington, D. D.; Pierce, S. A.

2017-12-01

High-volume heterogeneous datasets are becoming ubiquitous, migrating to center stage over the last ten years and transcending the boundaries of computationally intensive disciplines into the mainstream, becoming a fundamental part of every science discipline. Despite the fact that large datasets are now pervasive across industries and academic disciplines, the array of skills is generally absent from earth science programs. This has left the bulk of the student population without access to curricula that systematically teach appropriate intelligent-systems skills, creating a void for skill sets that should be universal given their need and marketability. While some guidance regarding appropriate computational thinking and pedagogy is appearing, there exist few examples where these have been specifically designed and tested within the earth science domain. Furthermore, best practices from learning science have not yet been widely tested for developing intelligent systems-thinking skills. This research developed and tested evidence based computational skill modules that target this deficit with the intention of informing the earth science community as it continues to incorporate intelligent systems techniques and reasoning into its research and classrooms.
Homogenised Australian climate datasets used for climate change monitoring

International Nuclear Information System (INIS)

Trewin, Blair; Jones, David; Collins; Dean; Jovanovic, Branislava; Braganza, Karl

2007-01-01

Full text: The Australian Bureau of Meteorology has developed a number of datasets for use in climate change monitoring. These datasets typically cover 50-200 stations distributed as evenly as possible over the Australian continent, and have been subject to detailed quality control and homogenisation.The time period over which data are available for each element is largely determined by the availability of data in digital form. Whilst nearly all Australian monthly and daily precipitation data have been digitised, a significant quantity of pre-1957 data (for temperature and evaporation) or pre-1987 data (for some other elements) remains to be digitised, and is not currently available for use in the climate change monitoring datasets. In the case of temperature and evaporation, the start date of the datasets is also determined by major changes in instruments or observing practices for which no adjustment is feasible at the present time. The datasets currently available cover: Monthly and daily precipitation (most stations commence 1915 or earlier, with many extending back to the late 19th century, and a few to the mid-19th century); Annual temperature (commences 1910); Daily temperature (commences 1910, with limited station coverage pre-1957); Twice-daily dewpoint/relative humidity (commences 1957); Monthly pan evaporation (commences 1970); Cloud amount (commences 1957) (Jovanovic etal. 2007). As well as the station-based datasets listed above, an additional dataset being developed for use in climate change monitoring (and other applications) covers tropical cyclones in the Australian region. This is described in more detail in Trewin (2007). The datasets already developed are used in analyses of observed climate change, which are available through the Australian Bureau of Meteorology website (http://www.bom.gov.au/silo/products/cli_chg/). They are also used as a basis for routine climate monitoring, and in the datasets used for the development of seasonal
Machine Learning Techniques in Clinical Vision Sciences.

Science.gov (United States)

Caixinha, Miguel; Nunes, Sandrina

2017-01-01

This review presents and discusses the contribution of machine learning techniques for diagnosis and disease monitoring in the context of clinical vision science. Many ocular diseases leading to blindness can be halted or delayed when detected and treated at its earliest stages. With the recent developments in diagnostic devices, imaging and genomics, new sources of data for early disease detection and patients' management are now available. Machine learning techniques emerged in the biomedical sciences as clinical decision-support techniques to improve sensitivity and specificity of disease detection and monitoring, increasing objectively the clinical decision-making process. This manuscript presents a review in multimodal ocular disease diagnosis and monitoring based on machine learning approaches. In the first section, the technical issues related to the different machine learning approaches will be present. Machine learning techniques are used to automatically recognize complex patterns in a given dataset. These techniques allows creating homogeneous groups (unsupervised learning), or creating a classifier predicting group membership of new cases (supervised learning), when a group label is available for each case. To ensure a good performance of the machine learning techniques in a given dataset, all possible sources of bias should be removed or minimized. For that, the representativeness of the input dataset for the true population should be confirmed, the noise should be removed, the missing data should be treated and the data dimensionally (i.e., the number of parameters/features and the number of cases in the dataset) should be adjusted. The application of machine learning techniques in ocular disease diagnosis and monitoring will be presented and discussed in the second section of this manuscript. To show the clinical benefits of machine learning in clinical vision sciences, several examples will be presented in glaucoma, age-related macular degeneration
Introduction of a simple-model-based land surface dataset for Europe

Science.gov (United States)

Orth, Rene; Seneviratne, Sonia I.

2015-04-01

Land surface hydrology can play a crucial role during extreme events such as droughts, floods and even heat waves. We introduce in this study a new hydrological dataset for Europe that consists of soil moisture, runoff and evapotranspiration (ET). It is derived with a simple water balance model (SWBM) forced with precipitation, temperature and net radiation. The SWBM dataset extends over the period 1984-2013 with a daily time step and 0.5° × 0.5° resolution. We employ a novel calibration approach, in which we consider 300 random parameter sets chosen from an observation-based range. Using several independent validation datasets representing soil moisture (or terrestrial water content), ET and streamflow, we identify the best performing parameter set and hence the new dataset. To illustrate its usefulness, the SWBM dataset is compared against several state-of-the-art datasets (ERA-Interim/Land, MERRA-Land, GLDAS-2-Noah, simulations of the Community Land Model Version 4), using all validation datasets as reference. For soil moisture dynamics it outperforms the benchmarks. Therefore the SWBM soil moisture dataset constitutes a reasonable alternative to sparse measurements, little validated model results, or proxy data such as precipitation indices. Also in terms of runoff the SWBM dataset performs well, whereas the evaluation of the SWBM ET dataset is overall satisfactory, but the dynamics are less well captured for this variable. This highlights the limitations of the dataset, as it is based on a simple model that uses uniform parameter values. Hence some processes impacting ET dynamics may not be captured, and quality issues may occur in regions with complex terrain. Even though the SWBM is well calibrated, it cannot replace more sophisticated models; but as their calibration is a complex task the present dataset may serve as a benchmark in future. In addition we investigate the sources of skill of the SWBM dataset and find that the parameter set has a similar
Data Mining for Imbalanced Datasets: An Overview

Science.gov (United States)

Chawla, Nitesh V.

A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.
From big data to deep insight in developmental science.

Science.gov (United States)

Gilmore, Rick O

2016-01-01

The use of the term 'big data' has grown substantially over the past several decades and is now widespread. In this review, I ask what makes data 'big' and what implications the size, density, or complexity of datasets have for the science of human development. A survey of existing datasets illustrates how existing large, complex, multilevel, and multimeasure data can reveal the complexities of developmental processes. At the same time, significant technical, policy, ethics, transparency, cultural, and conceptual issues associated with the use of big data must be addressed. Most big developmental science data are currently hard to find and cumbersome to access, the field lacks a culture of data sharing, and there is no consensus about who owns or should control research data. But, these barriers are dissolving. Developmental researchers are finding new ways to collect, manage, store, share, and enable others to reuse data. This promises a future in which big data can lead to deeper insights about some of the most profound questions in behavioral science. © 2016 The Authors. WIREs Cognitive Science published by Wiley Periodicals, Inc.
A hybrid organic-inorganic perovskite dataset

Science.gov (United States)

Kim, Chiho; Huan, Tran Doan; Krishnan, Sridevi; Ramprasad, Rampi

2017-05-01

Hybrid organic-inorganic perovskites (HOIPs) have been attracting a great deal of attention due to their versatility of electronic properties and fabrication methods. We prepare a dataset of 1,346 HOIPs, which features 16 organic cations, 3 group-IV cations and 4 halide anions. Using a combination of an atomic structure search method and density functional theory calculations, the optimized structures, the bandgap, the dielectric constant, and the relative energies of the HOIPs are uniformly prepared and validated by comparing with relevant experimental and/or theoretical data. We make the dataset available at Dryad Digital Repository, NoMaD Repository, and Khazana Repository (http://khazana.uconn.edu/), hoping that it could be useful for future data-mining efforts that can explore possible structure-property relationships and phenomenological models. Progressive extension of the dataset is expected as new organic cations become appropriate within the HOIP framework, and as additional properties are calculated for the new compounds found.
Genomics dataset of unidentified disclosed isolates

Directory of Open Access Journals (Sweden)

Bhagwan N. Rekadwad

2016-09-01

Full Text Available Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis. Keywords: BioLABs, Blunt ends, Genomics, NEB cutter, Restriction digestion, Short DNA sequences, Sticky ends
IPCC Socio-Economic Baseline Dataset

Data.gov (United States)

National Aeronautics and Space Administration — The Intergovernmental Panel on Climate Change (IPCC) Socio-Economic Baseline Dataset consists of population, human development, economic, water resources, land...
The LANDFIRE Refresh strategy: updating the national dataset

Science.gov (United States)

Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley

2013-01-01

The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.
Omicseq: a web-based search engine for exploring omics datasets

Science.gov (United States)

Sun, Xiaobo; Pittard, William S.; Xu, Tianlei; Chen, Li; Zwick, Michael E.; Jiang, Xiaoqian; Wang, Fusheng

2017-01-01

Abstract The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve ‘findability’ of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. PMID:28402462
Nanoparticle-organic pollutant interaction dataset

Data.gov (United States)

U.S. Environmental Protection Agency — Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration...
A Dataset for Three-Dimensional Distribution of 39 Elements Including Plant Nutrients and Other Metals and Metalloids in the Soils of a Forested Headwater Catchment.

Science.gov (United States)

Wu, B; Wiekenkamp, I; Sun, Y; Fisher, A S; Clough, R; Gottselig, N; Bogena, H; Pütz, T; Brüggemann, N; Vereecken, H; Bol, R

2017-11-01

Quantification and evaluation of elemental distribution in forested ecosystems are key requirements to understand element fluxes and their relationship with hydrological and biogeochemical processes in the system. However, datasets supporting such a study on the catchment scale are still limited. Here we provide a dataset comprising spatially highly resolved distributions of 39 elements in soil profiles of a small forested headwater catchment in western Germany () to gain a holistic picture of the state and fluxes of elements in the catchment. The elements include both plant nutrients and other metals and metalloids that were predominately derived from lithospheric or anthropogenic inputs, thereby allowing us to not only capture the nutrient status of the catchment but to also estimate the functional development of the ecosystem. Soil samples were collected at high lateral resolution (≤60 m), and element concentrations were determined vertically for four soil horizons (L/Of, Oh, A, B). From this, a three-dimensional view of the distribution of these elements could be established with high spatial resolution on the catchment scale in a temperate natural forested ecosystem. The dataset can be combined with other datasets and studies of the TERENO (Terrestrial Environmental Observatories) Data Discovery Portal () to reveal elemental fluxes, establish relations between elements and other soil properties, and/or as input for modeling elemental cycling in temperate forested ecosystems. Copyright © by the American Society of Agronomy, Crop Science Society of America, and Soil Science Society of America, Inc.
Internationally coordinated glacier monitoring: strategy and datasets

Science.gov (United States)

Hoelzle, Martin; Armstrong, Richard; Fetterer, Florence; Gärtner-Roer, Isabelle; Haeberli, Wilfried; Kääb, Andreas; Kargel, Jeff; Nussbaumer, Samuel; Paul, Frank; Raup, Bruce; Zemp, Michael

2014-05-01

Internationally coordinated monitoring of long-term glacier changes provide key indicator data about global climate change and began in the year 1894 as an internationally coordinated effort to establish standardized observations. Today, world-wide monitoring of glaciers and ice caps is embedded within the Global Climate Observing System (GCOS) in support of the United Nations Framework Convention on Climate Change (UNFCCC) as an important Essential Climate Variable (ECV). The Global Terrestrial Network for Glaciers (GTN-G) was established in 1999 with the task of coordinating measurements and to ensure the continuous development and adaptation of the international strategies to the long-term needs of users in science and policy. The basic monitoring principles must be relevant, feasible, comprehensive and understandable to a wider scientific community as well as to policy makers and the general public. Data access has to be free and unrestricted, the quality of the standardized and calibrated data must be high and a combination of detailed process studies at selected field sites with global coverage by satellite remote sensing is envisaged. Recently a GTN-G Steering Committee was established to guide and advise the operational bodies responsible for the international glacier monitoring, which are the World Glacier Monitoring Service (WGMS), the US National Snow and Ice Data Center (NSIDC), and the Global Land Ice Measurements from Space (GLIMS) initiative. Several online databases containing a wealth of diverse data types having different levels of detail and global coverage provide fast access to continuously updated information on glacier fluctuation and inventory data. For world-wide inventories, data are now available through (a) the World Glacier Inventory containing tabular information of about 130,000 glaciers covering an area of around 240,000 km2, (b) the GLIMS-database containing digital outlines of around 118,000 glaciers with different time stamps and
Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

Science.gov (United States)

Lary, D. J.

2013-12-01

A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.
Chemical product and function dataset

Data.gov (United States)

U.S. Environmental Protection Agency — Merged product weight fraction and chemical function data. This dataset is associated with the following publication: Isaacs , K., M. Goldsmith, P. Egeghy , K....
General Purpose Multimedia Dataset - GarageBand 2008

DEFF Research Database (Denmark)

Meng, Anders

This document describes a general purpose multimedia data-set to be used in cross-media machine learning problems. In more detail we describe the genre taxonomy applied at http://www.garageband.com, from where the data-set was collected, and how the taxonomy have been fused into a more human...... understandable taxonomy. Finally, a description of various features extracted from both the audio and text are presented....
Omicseq: a web-based search engine for exploring omics datasets.

Science.gov (United States)

Sun, Xiaobo; Pittard, William S; Xu, Tianlei; Chen, Li; Zwick, Michael E; Jiang, Xiaoqian; Wang, Fusheng; Qin, Zhaohui S

2017-07-03

The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

Quantifying uncertainty in observational rainfall datasets

Science.gov (United States)

Lennard, Chris; Dosio, Alessandro; Nikulin, Grigory; Pinto, Izidine; Seid, Hussen

2015-04-01

The CO-ordinated Regional Downscaling Experiment (CORDEX) has to date seen the publication of at least ten journal papers that examine the African domain during 2012 and 2013. Five of these papers consider Africa generally (Nikulin et al. 2012, Kim et al. 2013, Hernandes-Dias et al. 2013, Laprise et al. 2013, Panitz et al. 2013) and five have regional foci: Tramblay et al. (2013) on Northern Africa, Mariotti et al. (2014) and Gbobaniyi el al. (2013) on West Africa, Endris et al. (2013) on East Africa and Kalagnoumou et al. (2013) on southern Africa. There also are a further three papers that the authors know about under review. These papers all use an observed rainfall and/or temperature data to evaluate/validate the regional model output and often proceed to assess projected changes in these variables due to climate change in the context of these observations. The most popular reference rainfall data used are the CRU, GPCP, GPCC, TRMM and UDEL datasets. However, as Kalagnoumou et al. (2013) point out there are many other rainfall datasets available for consideration, for example, CMORPH, FEWS, TAMSAT & RIANNAA, TAMORA and the WATCH & WATCH-DEI data. They, with others (Nikulin et al. 2012, Sylla et al. 2012) show that the observed datasets can have a very wide spread at a particular space-time coordinate. As more ground, space and reanalysis-based rainfall products become available, all which use different methods to produce precipitation data, the selection of reference data is becoming an important factor in model evaluation. A number of factors can contribute to a uncertainty in terms of the reliability and validity of the datasets such as radiance conversion algorithims, the quantity and quality of available station data, interpolation techniques and blending methods used to combine satellite and guage based products. However, to date no comprehensive study has been performed to evaluate the uncertainty in these observational datasets. We assess 18 gridded
Turkey Run Landfill Emissions Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — landfill emissions measurements for the Turkey run landfill in Georgia. This dataset is associated with the following publication: De la Cruz, F., R. Green, G....
Topic modeling for cluster analysis of large biological and medical datasets.

Science.gov (United States)

Zhao, Weizhong; Zou, Wen; Chen, James J

2014-01-01

The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting
SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data

Science.gov (United States)

Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.

2015-12-01

Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These
An Analysis of the GTZAN Music Genre Dataset

DEFF Research Database (Denmark)

Sturm, Bob L.

2012-01-01

Most research in automatic music genre recognition has used the dataset assembled by Tzanetakis et al. in 2001. The composition and integrity of this dataset, however, has never been formally analyzed. For the first time, we provide an analysis of its composition, and create a machine...
Dataset definition for CMS operations and physics analyses

Science.gov (United States)

Franzoni, Giovanni; Compact Muon Solenoid Collaboration

2016-04-01

Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets and secondary datasets/dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concepts of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the LHC run I, and we discuss the plans for the run II.
Dataset definition for CMS operations and physics analyses

CERN Document Server

AUTHOR|(CDS)2051291

2016-01-01

Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets, secondary datasets, and dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concept of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the first run, and we discuss the plans for the second LHC run.
Dataset of NRDA emission data

Data.gov (United States)

U.S. Environmental Protection Agency — Emissions data from open air oil burns. This dataset is associated with the following publication: Gullett, B., J. Aurell, A. Holder, B. Mitchell, D. Greenwell, M....
Discovery and Reuse of Open Datasets: An Exploratory Study

Directory of Open Access Journals (Sweden)

Sara

2016-07-01

Full Text Available Objective: This article analyzes twenty cited or downloaded datasets and the repositories that house them, in order to produce insights that can be used by academic libraries to encourage discovery and reuse of research data in institutional repositories. Methods: Using Thomson Reuters’ Data Citation Index and repository download statistics, we identified twenty cited/downloaded datasets. We documented the characteristics of the cited/downloaded datasets and their corresponding repositories in a self-designed rubric. The rubric includes six major categories: basic information; funding agency and journal information; linking and sharing; factors to encourage reuse; repository characteristics; and data description. Results: Our small-scale study suggests that cited/downloaded datasets generally comply with basic recommendations for facilitating reuse: data are documented well; formatted for use with a variety of software; and shared in established, open access repositories. Three significant factors also appear to contribute to dataset discovery: publishing in discipline-specific repositories; indexing in more than one location on the web; and using persistent identifiers. The cited/downloaded datasets in our analysis came from a few specific disciplines, and tended to be funded by agencies with data publication mandates. Conclusions: The results of this exploratory research provide insights that can inform academic librarians as they work to encourage discovery and reuse of institutional datasets. Our analysis also suggests areas in which academic librarians can target open data advocacy in their communities in order to begin to build open data success stories that will fuel future advocacy efforts.
Visualization of conserved structures by fusing highly variable datasets.

Science.gov (United States)

Silverstein, Jonathan C; Chhadia, Ankur; Dech, Fred

2002-01-01

Skill, effort, and time are required to identify and visualize anatomic structures in three-dimensions from radiological data. Fundamentally, automating these processes requires a technique that uses symbolic information not in the dynamic range of the voxel data. We were developing such a technique based on mutual information for automatic multi-modality image fusion (MIAMI Fuse, University of Michigan). This system previously demonstrated facility at fusing one voxel dataset with integrated symbolic structure information to a CT dataset (different scale and resolution) from the same person. The next step of development of our technique was aimed at accommodating the variability of anatomy from patient to patient by using warping to fuse our standard dataset to arbitrary patient CT datasets. A standard symbolic information dataset was created from the full color Visible Human Female by segmenting the liver parenchyma, portal veins, and hepatic veins and overwriting each set of voxels with a fixed color. Two arbitrarily selected patient CT scans of the abdomen were used for reference datasets. We used the warping functions in MIAMI Fuse to align the standard structure data to each patient scan. The key to successful fusion was the focused use of multiple warping control points that place themselves around the structure of interest automatically. The user assigns only a few initial control points to align the scans. Fusion 1 and 2 transformed the atlas with 27 points around the liver to CT1 and CT2 respectively. Fusion 3 transformed the atlas with 45 control points around the liver to CT1 and Fusion 4 transformed the atlas with 5 control points around the portal vein. The CT dataset is augmented with the transformed standard structure dataset, such that the warped structure masks are visualized in combination with the original patient dataset. This combined volume visualization is then rendered interactively in stereo on the ImmersaDesk in an immersive Virtual
An Annotated Dataset of 14 Cardiac MR Images

DEFF Research Database (Denmark)

Stegmann, Mikkel Bille

2002-01-01

This note describes a dataset consisting of 14 annotated cardiac MR images. Points of correspondence are placed on each image at the left ventricle (LV). As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given....
Dataset - Adviesregel PPL 2010

NARCIS (Netherlands)

Evert, van F.K.; Schans, van der D.A.; Geel, van W.C.A.; Slabbekoorn, J.J.; Booij, R.; Jukema, J.N.; Meurs, E.J.J.; Uenk, D.

2011-01-01

This dataset contains experimental data from a number of field experiments with potato in The Netherlands (Van Evert et al., 2011). The data are presented as an SQL dump of a PostgreSQL database (version 8.4.4). An outline of the entity-relationship diagram of the database is given in an
Tension in the recent Type Ia supernovae datasets

International Nuclear Information System (INIS)

Wei, Hao

2010-01-01

In the present work, we investigate the tension in the recent Type Ia supernovae (SNIa) datasets Constitution and Union. We show that they are in tension not only with the observations of the cosmic microwave background (CMB) anisotropy and the baryon acoustic oscillations (BAO), but also with other SNIa datasets such as Davis and SNLS. Then, we find the main sources responsible for the tension. Further, we make this more robust by employing the method of random truncation. Based on the results of this work, we suggest two truncated versions of the Union and Constitution datasets, namely the UnionT and ConstitutionT SNIa samples, whose behaviors are more regular.
Viability of Controlling Prosthetic Hand Utilizing Electroencephalograph (EEG) Dataset Signal

Science.gov (United States)

Miskon, Azizi; A/L Thanakodi, Suresh; Raihan Mazlan, Mohd; Mohd Haziq Azhar, Satria; Nooraya Mohd Tawil, Siti

2016-11-01

This project presents the development of an artificial hand controlled by Electroencephalograph (EEG) signal datasets for the prosthetic application. The EEG signal datasets were used as to improvise the way to control the prosthetic hand compared to the Electromyograph (EMG). The EMG has disadvantages to a person, who has not used the muscle for a long time and also to person with degenerative issues due to age factor. Thus, the EEG datasets found to be an alternative for EMG. The datasets used in this work were taken from Brain Computer Interface (BCI) Project. The datasets were already classified for open, close and combined movement operations. It served the purpose as an input to control the prosthetic hand by using an Interface system between Microsoft Visual Studio and Arduino. The obtained results reveal the prosthetic hand to be more efficient and faster in response to the EEG datasets with an additional LiPo (Lithium Polymer) battery attached to the prosthetic. Some limitations were also identified in terms of the hand movements, weight of the prosthetic, and the suggestions to improve were concluded in this paper. Overall, the objective of this paper were achieved when the prosthetic hand found to be feasible in operation utilizing the EEG datasets.
Technical note: An inorganic water chemistry dataset (1972–2011 ...

African Journals Online (AJOL)

A national dataset of inorganic chemical data of surface waters (rivers, lakes, and dams) in South Africa is presented and made freely available. The dataset comprises more than 500 000 complete water analyses from 1972 up to 2011, collected from more than 2 000 sample monitoring stations in South Africa. The dataset ...
QSAR ligand dataset for modelling mutagenicity, genotoxicity, and rodent carcinogenicity

Directory of Open Access Journals (Sweden)

Davy Guan

2018-04-01

Full Text Available Five datasets were constructed from ligand and bioassay result data from the literature. These datasets include bioassay results from the Ames mutagenicity assay, Greenscreen GADD-45a-GFP assay, Syrian Hamster Embryo (SHE assay, and 2 year rat carcinogenicity assay results. These datasets provide information about chemical mutagenicity, genotoxicity and carcinogenicity.
The Dataset of Countries at Risk of Electoral Violence

OpenAIRE

Birch, Sarah; Muchlinski, David

2017-01-01

Electoral violence is increasingly affecting elections around the world, yet researchers have been limited by a paucity of granular data on this phenomenon. This paper introduces and describes a new dataset of electoral violence – the Dataset of Countries at Risk of Electoral Violence (CREV) – that provides measures of 10 different types of electoral violence across 642 elections held around the globe between 1995 and 2013. The paper provides a detailed account of how and why the dataset was ...
Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

Science.gov (United States)

Spjuth, Ola; Willighagen, Egon L; Guha, Rajarshi; Eklund, Martin; Wikberg, Jarl Es

2010-06-30

QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but
Towards interoperable and reproducible QSAR analyses: Exchange of datasets

Directory of Open Access Journals (Sweden)

Spjuth Ola

2010-06-01

Full Text Available Abstract Background QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. Results We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Conclusions Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join
VideoWeb Dataset for Multi-camera Activities and Non-verbal Communication

Science.gov (United States)

Denina, Giovanni; Bhanu, Bir; Nguyen, Hoang Thanh; Ding, Chong; Kamal, Ahmed; Ravishankar, Chinya; Roy-Chowdhury, Amit; Ivers, Allen; Varda, Brenda

Human-activity recognition is one of the most challenging problems in computer vision. Researchers from around the world have tried to solve this problem and have come a long way in recognizing simple motions and atomic activities. As the computer vision community heads toward fully recognizing human activities, a challenging and labeled dataset is needed. To respond to that need, we collected a dataset of realistic scenarios in a multi-camera network environment (VideoWeb) involving multiple persons performing dozens of different repetitive and non-repetitive activities. This chapter describes the details of the dataset. We believe that this VideoWeb Activities dataset is unique and it is one of the most challenging datasets available today. The dataset is publicly available online at http://vwdata.ee.ucr.edu/ along with the data annotation.

Wind energy prospecting: socio-economic value of a new wind resource assessment technique based on a NASA Earth science dataset

Science.gov (United States)

Vanvyve, E.; Magontier, P.; Vandenberghe, F. C.; Delle Monache, L.; Dickinson, K.

2012-12-01

Wind energy is amongst the fastest growing sources of renewable energy in the U.S. and could supply up to 20 % of the U.S power production by 2030. An accurate and reliable wind resource assessment for prospective wind farm sites is a challenging task, yet is crucial for evaluating the long-term profitability and feasibility of a potential development. We have developed an accurate and computationally efficient wind resource assessment technique for prospective wind farm sites, which incorporates innovative statistical techniques and the new NASA Earth science dataset MERRA. This technique produces a wind resource estimate that is more accurate than that obtained by the wind energy industry's standard technique, while providing a reliable quantification of its uncertainty. The focus now is on evaluating the socio-economic value of this new technique upon using the industry's standard technique. Would it yield lower financing costs? Could it result in lower electricity prices? Are there further down-the-line positive consequences, e.g. job creation, time saved, greenhouse gas decrease? Ultimately, we expect our results will inform efforts to refine and disseminate the new technique to support the development of the U.S. renewable energy infrastructure. In order to address the above questions, we are carrying out a cost-benefit analysis based on the net present worth of the technique. We will describe this approach, including the cash-flow process of wind farm financing, how the wind resource assessment factors in, and will present current results for various hypothetical candidate wind farm sites.
Active Semisupervised Clustering Algorithm with Label Propagation for Imbalanced and Multidensity Datasets

Directory of Open Access Journals (Sweden)

Mingwei Leng

2013-01-01

Full Text Available The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.
A reanalysis dataset of the South China Sea

Science.gov (United States)

Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

2014-01-01

Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992–2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability. PMID:25977803
A dataset of forest biomass structure for Eurasia.

Science.gov (United States)

Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

2017-05-16

The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.
A Dataset for Visual Navigation with Neuromorphic Methods

Directory of Open Access Journals (Sweden)

Francisco eBarranco

2016-02-01

Full Text Available Standardized benchmarks in Computer Vision have greatly contributed to the advance of approaches to many problems in the field. If we want to enhance the visibility of event-driven vision and increase its impact, we will need benchmarks that allow comparison among different neuromorphic methods as well as comparison to Computer Vision conventional approaches. We present datasets to evaluate the accuracy of frame-free and frame-based approaches for tasks of visual navigation. Similar to conventional Computer Vision datasets, we provide synthetic and real scenes, with the synthetic data created with graphics packages, and the real data recorded using a mobile robotic platform carrying a dynamic and active pixel vision sensor (DAVIS and an RGB+Depth sensor. For both datasets the cameras move with a rigid motion in a static scene, and the data includes the images, events, optic flow, 3D camera motion, and the depth of the scene, along with calibration procedures. Finally, we also provide simulated event data generated synthetically from well-known frame-based optical flow datasets.
Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

Science.gov (United States)

Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

2014-01-01

SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111
Bridging Informatics and Earth Science: a Look at Gregory Leptoukh's Contributions

Science.gov (United States)

Lynnes, C.

2012-12-01

With the tragic passing this year of Gregory Leptoukh, the Earth and Space Sciences community lost a tireless participant in--and advocate for--science informatics. Throughout his career at NASA, Dr. Leptoukh established a theme of bridging the gulf between the informatics and science communities. Nowhere is this more evident than his leadership in the development of Giovanni (GES DISC Interactive Online Visualization ANd aNalysis Infrastructure). Giovanni is an online tool that serves to hide the often-complex technical details of data format and structure, making science data easier to explore and use by Earth scientists. To date Giovanni has been acknowledged as a contributor in 500-odd scientific articles. In recent years, Leptoukh concentrated his efforts on multi-sensor data inter-comparison, merging and fusion. This work exposed several challenges at the intersection of data and science. One of these was the ease with which a naive user might generate spurious comparisons, a potential hazard that was the genesis of the Multi-sensor Data Synergy Advisor (MDSA). The MDSA uses semantic ontologies and inference rules to organize knowledge about dataset quality and other salient characteristics in order to advise users on potential caveats for comparing or merging two datasets. Recently, Leptoukh also led the development of AeroStat, an online Giovanni instance to investigate aerosols via statistics from station and satellite comparisons and merged maps of data from more than one instrument. Aerostat offers a neural net based bias adjustment to "harmonize" the data by removing systematic offsets between datasets before merging. These examples exhibit Leptoukh's talent for adopting advanced computer technologies in the service of making science data more accessible to researchers. In this, he set an example that is at once both vital and challenging for the ESSI community to emulate.
New GES DISC Services Shortening the Path in Science Data Discovery

Science.gov (United States)

Li, Angela; Shie, Chung-Lin; Petrenko, Maksym; Hegde, Mahabaleshwa; Teng, William; Liu, Zhong; Bryant, Keith; Shen, Suhung; Hearty, Thomas; Wei, Jennifer;

2017-01-01

The Current GES DISC available services only allow user to select variables from a single dataset at a time and too many variables from a dataset are displayed, choice is hard. At American Geophysical Union (AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a new service: Datalist. A Datalist is a collection of predefined or user-defined data variables from one or more archived datasets. Our science support team curated predefined datalist and provided value to the user community. Imagine some novice user wants to study hurricane and typed in hurricane in the search box. The first item in the search result is GES DISC provided Hurricane Datalist. It contains scientists recommended variables from multiple datasets like TRMM, GPM, MERRA, etc. Datalist uses the same architecture as that of our new website, which also provides one-stop shopping for data, metadata, citation, documentation, visualization and other available services.We implemented Datalist with new GES DISC web architecture, one single web page that unified all user interfaces. From that webpage, users can find data by either type in keyword, or browse by category. It also provides user with a sophisticated integrated data and services package, including metadata, citation, documentation, visualization, and data-specific services, all available from one-stop shopping.

An Analysis on Better Testing than Training Performances on the Iris Dataset

NARCIS (Netherlands)

Schutten, Marten; Wiering, Marco

2016-01-01

The Iris dataset is a well known dataset containing information on three different types of Iris flowers. A typical and popular method for solving classification problems on datasets such as the Iris set is the support vector machine (SVM). In order to do so the dataset is separated in a set used
Accuracy assessment of seven global land cover datasets over China

Science.gov (United States)

Yang, Yongke; Xiao, Pengfeng; Feng, Xuezhi; Li, Haixing

2017-03-01

Land cover (LC) is the vital foundation to Earth science. Up to now, several global LC datasets have arisen with efforts of many scientific communities. To provide guidelines for data usage over China, nine LC maps from seven global LC datasets (IGBP DISCover, UMD, GLC, MCD12Q1, GLCNMO, CCI-LC, and GlobeLand30) were evaluated in this study. First, we compared their similarities and discrepancies in both area and spatial patterns, and analysed their inherent relations to data sources and classification schemes and methods. Next, five sets of validation sample units (VSUs) were collected to calculate their accuracy quantitatively. Further, we built a spatial analysis model and depicted their spatial variation in accuracy based on the five sets of VSUs. The results show that, there are evident discrepancies among these LC maps in both area and spatial patterns. For LC maps produced by different institutes, GLC 2000 and CCI-LC 2000 have the highest overall spatial agreement (53.8%). For LC maps produced by same institutes, overall spatial agreement of CCI-LC 2000 and 2010, and MCD12Q1 2001 and 2010 reach up to 99.8% and 73.2%, respectively; while more efforts are still needed if we hope to use these LC maps as time series data for model inputting, since both CCI-LC and MCD12Q1 fail to represent the rapid changing trend of several key LC classes in the early 21st century, in particular urban and built-up, snow and ice, water bodies, and permanent wetlands. With the highest spatial resolution, the overall accuracy of GlobeLand30 2010 is 82.39%. For the other six LC datasets with coarse resolution, CCI-LC 2010/2000 has the highest overall accuracy, and following are MCD12Q1 2010/2001, GLC 2000, GLCNMO 2008, IGBP DISCover, and UMD in turn. Beside that all maps exhibit high accuracy in homogeneous regions; local accuracies in other regions are quite different, particularly in Farming-Pastoral Zone of North China, mountains in Northeast China, and Southeast Hills. Special
Interactive visualization and analysis of multimodal datasets for surgical applications.

Science.gov (United States)

Kirmizibayrak, Can; Yim, Yeny; Wakid, Mike; Hahn, James

2012-12-01

Surgeons use information from multiple sources when making surgical decisions. These include volumetric datasets (such as CT, PET, MRI, and their variants), 2D datasets (such as endoscopic videos), and vector-valued datasets (such as computer simulations). Presenting all the information to the user in an effective manner is a challenging problem. In this paper, we present a visualization approach that displays the information from various sources in a single coherent view. The system allows the user to explore and manipulate volumetric datasets, display analysis of dataset values in local regions, combine 2D and 3D imaging modalities and display results of vector-based computer simulations. Several interaction methods are discussed: in addition to traditional interfaces including mouse and trackers, gesture-based natural interaction methods are shown to control these visualizations with real-time performance. An example of a medical application (medialization laryngoplasty) is presented to demonstrate how the combination of different modalities can be used in a surgical setting with our approach.
Something From Nothing (There): Collecting Global IPv6 Datasets from DNS

NARCIS (Netherlands)

Fiebig, T.; Borgolte, Kevin; Hao, Shuang; Kruegel, Christopher; Vigna, Giovanny; Spring, Neil; Riley, George F.

2017-01-01

Current large-scale IPv6 studies mostly rely on non-public datasets, asmost public datasets are domain specific. For instance, traceroute-based datasetsare biased toward network equipment. In this paper, we present a new methodologyto collect IPv6 address datasets that does not require access to
Automatic processing of multimodal tomography datasets.

Science.gov (United States)

Parsons, Aaron D; Price, Stephen W T; Wadeson, Nicola; Basham, Mark; Beale, Andrew M; Ashton, Alun W; Mosselmans, J Frederick W; Quinn, Paul D

2017-01-01

With the development of fourth-generation high-brightness synchrotrons on the horizon, the already large volume of data that will be collected on imaging and mapping beamlines is set to increase by orders of magnitude. As such, an easy and accessible way of dealing with such large datasets as quickly as possible is required in order to be able to address the core scientific problems during the experimental data collection. Savu is an accessible and flexible big data processing framework that is able to deal with both the variety and the volume of data of multimodal and multidimensional scientific datasets output such as those from chemical tomography experiments on the I18 microfocus scanning beamline at Diamond Light Source.
GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.

Science.gov (United States)

Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

2015-07-02

A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.
A Research Graph dataset for connecting research data repositories using RD-Switchboard.

Science.gov (United States)

Aryani, Amir; Poblet, Marta; Unsworth, Kathryn; Wang, Jingbo; Evans, Ben; Devaraju, Anusuriya; Hausstein, Brigitte; Klas, Claus-Peter; Zapilko, Benjamin; Kaplun, Samuele

2018-05-29

This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants. The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation. Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.
seNorge2 daily precipitation, an observational gridded dataset over Norway from 1957 to the present day

Science.gov (United States)

Lussana, Cristian; Saloranta, Tuomo; Skaugen, Thomas; Magnusson, Jan; Tveito, Ole Einar; Andersen, Jess

2018-02-01

The conventional climate gridded datasets based on observations only are widely used in atmospheric sciences; our focus in this paper is on climate and hydrology. On the Norwegian mainland, seNorge2 provides high-resolution fields of daily total precipitation for applications requiring long-term datasets at regional or national level, where the challenge is to simulate small-scale processes often taking place in complex terrain. The dataset constitutes a valuable meteorological input for snow and hydrological simulations; it is updated daily and presented on a high-resolution grid (1 km of grid spacing). The climate archive goes back to 1957. The spatial interpolation scheme builds upon classical methods, such as optimal interpolation and successive-correction schemes. An original approach based on (spatial) scale-separation concepts has been implemented which uses geographical coordinates and elevation as complementary information in the interpolation. seNorge2 daily precipitation fields represent local precipitation features at spatial scales of a few kilometers, depending on the station network density. In the surroundings of a station or in dense station areas, the predictions are quite accurate even for intense precipitation. For most of the grid points, the performances are comparable to or better than a state-of-the-art pan-European dataset (E-OBS), because of the higher effective resolution of seNorge2. However, in very data-sparse areas, such as in the mountainous region of southern Norway, seNorge2 underestimates precipitation because it does not make use of enough geographical information to compensate for the lack of observations. The evaluation of seNorge2 as the meteorological forcing for the seNorge snow model and the DDD (Distance Distribution Dynamics) rainfall-runoff model shows that both models have been able to make profitable use of seNorge2, partly because of the automatic calibration procedure they incorporate for precipitation. The seNorge2
Process mining in oncology using the MIMIC-III dataset

Science.gov (United States)

Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen

2018-03-01

Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.
Veterans Affairs Suicide Prevention Synthetic Dataset

Data.gov (United States)

Department of Veterans Affairs — The VA's Veteran Health Administration, in support of the Open Data Initiative, is providing the Veterans Affairs Suicide Prevention Synthetic Dataset (VASPSD). The...
SAR image classification based on CNN in real and simulation datasets

Science.gov (United States)

Peng, Lijiang; Liu, Ming; Liu, Xiaohua; Dong, Liquan; Hui, Mei; Zhao, Yuejin

2018-04-01

Convolution neural network (CNN) has made great success in image classification tasks. Even in the field of synthetic aperture radar automatic target recognition (SAR-ATR), state-of-art results has been obtained by learning deep representation of features on the MSTAR benchmark. However, the raw data of MSTAR have shortcomings in training a SAR-ATR model because of high similarity in background among the SAR images of each kind. This indicates that the CNN would learn the hierarchies of features of backgrounds as well as the targets. To validate the influence of the background, some other SAR images datasets have been made which contains the simulation SAR images of 10 manufactured targets such as tank and fighter aircraft, and the backgrounds of simulation SAR images are sampled from the whole original MSTAR data. The simulation datasets contain the dataset that the backgrounds of each kind images correspond to the one kind of backgrounds of MSTAR targets or clutters and the dataset that each image shares the random background of whole MSTAR targets or clutters. In addition, mixed datasets of MSTAR and simulation datasets had been made to use in the experiments. The CNN architecture proposed in this paper are trained on all datasets mentioned above. The experimental results shows that the architecture can get high performances on all datasets even the backgrounds of the images are miscellaneous, which indicates the architecture can learn a good representation of the targets even though the drastic changes on background.
On sample size and different interpretations of snow stability datasets

Science.gov (United States)

Schirmer, M.; Mitterer, C.; Schweizer, J.

2009-04-01

Interpretations of snow stability variations need an assessment of the stability itself, independent of the scale investigated in the study. Studies on stability variations at a regional scale have often chosen stability tests such as the Rutschblock test or combinations of various tests in order to detect differences in aspect and elevation. The question arose: ‘how capable are such stability interpretations in drawing conclusions'. There are at least three possible errors sources: (i) the variance of the stability test itself; (ii) the stability variance at an underlying slope scale, and (iii) that the stability interpretation might not be directly related to the probability of skier triggering. Various stability interpretations have been proposed in the past that provide partly different results. We compared a subjective one based on expert knowledge with a more objective one based on a measure derived from comparing skier-triggered slopes vs. slopes that have been skied but not triggered. In this study, the uncertainties are discussed and their effects on regional scale stability variations will be quantified in a pragmatic way. An existing dataset with very large sample sizes was revisited. This dataset contained the variance of stability at a regional scale for several situations. The stability in this dataset was determined using the subjective interpretation scheme based on expert knowledge. The question to be answered was how many measurements were needed to obtain similar results (mainly stability differences in aspect or elevation) as with the complete dataset. The optimal sample size was obtained in several ways: (i) assuming a nominal data scale the sample size was determined with a given test, significance level and power, and by calculating the mean and standard deviation of the complete dataset. With this method it can also be determined if the complete dataset consists of an appropriate sample size. (ii) Smaller subsets were created with similar

Development of an Operational TS Dataset Production System for the Data Assimilation System

Science.gov (United States)

Kim, Sung Dae; Park, Hyuk Min; Kim, Young Ho; Park, Kwang Soon

2017-04-01

An operational TS (Temperature and Salinity) dataset production system was developed to provide near real-time data to the data assimilation system periodically. It collects the latest 15 days' TS data of the north western pacific area (20°N - 55°N, 110°E - 150°E), applies QC tests to the archived data and supplies them to numerical prediction models of KIOST (Korea Institute of Ocean Science and Technology). The latest real-time TS data are collected from Argo GDAC and GTSPP data server every week. Argo data are downloaded from /latest_data directory of Argo GDAC. Because many duplicated data exist when all profile data are extracted from all Argo netCDF files, DB system is used to avoid duplication. All metadata (float ID, location, observation date and time, etc) of all Argo floats is stored into Database system and a Matlab program was developed to manipulate DB data, to check the duplication and to exclude duplicated data. GTSPP data are downloaded from /realtime directory of GTSPP data service. The latest data except ARGO data are extracted from the original data. Another Matlab program was coded to inspect all collected data using 10 QC tests and produce final dataset which can be used by the assimilation system. Three regional range tests to inspect annual, seasonal and monthly variations are included in the QC procedures. The C program was developed to provide regional ranges to data managers. It can calculate upper limit and lower limit of temperature and salinity at depth from 0 to 1550m. The final TS dataset contains the latest 15 days' TS data in netCDF format. It is updated every week and transmitted to numerical modeler of KIOST for operational use.
Really big data: Processing and analysis of large datasets

Science.gov (United States)

Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidl...
Geographic information system datasets of regolith-thickness data, regolith-thickness contours, raster-based regolith thickness, and aquifer-test and specific-capacity data for the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado

Science.gov (United States)

Arnold, L. Rick

2010-01-01

These datasets were compiled in support of U.S. Geological Survey Scientific-Investigations Report 2010-5082-Hydrogeology and Steady-State Numerical Simulation of Groundwater Flow in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. The datasets were developed by the U.S. Geological Survey in cooperation with the Lost Creek Ground Water Management District and the Colorado Geological Survey. The four datasets are described as follows and methods used to develop the datasets are further described in Scientific-Investigations Report 2010-5082: (1) ds507_regolith_data: This point dataset contains geologic information concerning regolith (unconsolidated sediment) thickness and top-of-bedrock altitude at selected well and test-hole locations in and near the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Data were compiled from published reports, consultant reports, and from lithologic logs of wells and test holes on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources. (2) ds507_regthick_contours: This dataset consists of contours showing generalized lines of equal regolith thickness overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness was contoured manually on the basis of information provided in the dataset ds507_regolith_data. (3) ds507_regthick_grid: This dataset consists of raster-based generalized thickness of regolith overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness in this dataset was derived from contours presented in the dataset ds507_regthick_contours. (4) ds507_welltest_data: This point dataset contains estimates of aquifer transmissivity and hydraulic conductivity at selected well locations in the Lost Creek Designated Ground Water Basin, Weld, Adams, and
A robust dataset-agnostic heart disease classifier from Phonocardiogram.

Science.gov (United States)

Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M

2017-07-01

Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.
A Comparative Analysis of Classification Algorithms on Diverse Datasets

Directory of Open Access Journals (Sweden)

M. Alghobiri

2018-04-01

Full Text Available Data mining involves the computational process to find patterns from large data sets. Classification, one of the main domains of data mining, involves known structure generalizing to apply to a new dataset and predict its class. There are various classification algorithms being used to classify various data sets. They are based on different methods such as probability, decision tree, neural network, nearest neighbor, boolean and fuzzy logic, kernel-based etc. In this paper, we apply three diverse classification algorithms on ten datasets. The datasets have been selected based on their size and/or number and nature of attributes. Results have been discussed using some performance evaluation measures like precision, accuracy, F-measure, Kappa statistics, mean absolute error, relative absolute error, ROC Area etc. Comparative analysis has been carried out using the performance evaluation measures of accuracy, precision, and F-measure. We specify features and limitations of the classification algorithms for the diverse nature datasets.
An assessment of differences in gridded precipitation datasets in complex terrain

Science.gov (United States)

Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.

2018-01-01

Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.
Strontium removal jar test dataset for all figures and tables.

Data.gov (United States)

U.S. Environmental Protection Agency — The datasets where used to generate data to demonstrate strontium removal under various water quality and treatment conditions. This dataset is associated with the...
Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems

Science.gov (United States)

Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille

2016-04-01

Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not
Hydrodynamic modelling and global datasets: Flow connectivity and SRTM data, a Bangkok case study.

Science.gov (United States)

Trigg, M. A.; Bates, P. B.; Michaelides, K.

2012-04-01

The rise in the global interconnected manufacturing supply chains requires an understanding and consistent quantification of flood risk at a global scale. Flood risk is often better quantified (or at least more precisely defined) in regions where there has been an investment in comprehensive topographical data collection such as LiDAR coupled with detailed hydrodynamic modelling. Yet in regions where these data and modelling are unavailable, the implications of flooding and the knock on effects for global industries can be dramatic, as evidenced by the recent floods in Bangkok, Thailand. There is a growing momentum in terms of global modelling initiatives to address this lack of a consistent understanding of flood risk and they will rely heavily on the application of available global datasets relevant to hydrodynamic modelling, such as Shuttle Radar Topography Mission (SRTM) data and its derivatives. These global datasets bring opportunities to apply consistent methodologies on an automated basis in all regions, while the use of coarser scale datasets also brings many challenges such as sub-grid process representation and downscaled hydrology data from global climate models. There are significant opportunities for hydrological science in helping define new, realistic and physically based methodologies that can be applied globally as well as the possibility of gaining new insights into flood risk through analysis of the many large datasets that will be derived from this work. We use Bangkok as a case study to explore some of the issues related to using these available global datasets for hydrodynamic modelling, with particular focus on using SRTM data to represent topography. Research has shown that flow connectivity on the floodplain is an important component in the dynamics of flood flows on to and off the floodplain, and indeed within different areas of the floodplain. A lack of representation of flow connectivity, often due to data resolution limitations, means
Individuals with greater science literacy and education have more polarized beliefs on controversial science topics.

Science.gov (United States)

Drummond, Caitlin; Fischhoff, Baruch

2017-09-05

Although Americans generally hold science in high regard and respect its findings, for some contested issues, such as the existence of anthropogenic climate change, public opinion is polarized along religious and political lines. We ask whether individuals with more general education and greater science knowledge, measured in terms of science education and science literacy, display more (or less) polarized beliefs on several such issues. We report secondary analyses of a nationally representative dataset (the General Social Survey), examining the predictors of beliefs regarding six potentially controversial issues. We find that beliefs are correlated with both political and religious identity for stem cell research, the Big Bang, and human evolution, and with political identity alone on climate change. Individuals with greater education, science education, and science literacy display more polarized beliefs on these issues. We find little evidence of political or religious polarization regarding nanotechnology and genetically modified foods. On all six topics, people who trust the scientific enterprise more are also more likely to accept its findings. We discuss the causal mechanisms that might underlie the correlation between education and identity-based polarization.
Introduction to statistical data analysis for the life sciences

CERN Document Server

Ekstrom, Claus Thorn

2014-01-01

This text provides a computational toolbox that enables students to analyze real datasets and gain the confidence and skills to undertake more sophisticated analyses. Although accessible with any statistical software, the text encourages a reliance on R. For those new to R, an introduction to the software is available in an appendix. The book also includes end-of-chapter exercises as well as an entire chapter of case exercises that help students apply their knowledge to larger datasets and learn more about approaches specific to the life sciences.
SIAM 2007 Text Mining Competition dataset

Data.gov (United States)

National Aeronautics and Space Administration — Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining...
Environmental Dataset Gateway (EDG) REST Interface

Data.gov (United States)

U.S. Environmental Protection Agency — Use the Environmental Dataset Gateway (EDG) to find and access EPA's environmental resources. Many options are available for easily reusing EDG content in other...
Geoseq: a tool for dissecting deep-sequencing datasets

Directory of Open Access Journals (Sweden)

Homann Robert

2010-10-01

Full Text Available Abstract Background Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO, Sequence Read Archive (SRA hosted by the NCBI, or the DNA Data Bank of Japan (ddbj. Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Results Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Conclusions Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a identify differential isoform expression in mRNA-seq datasets, b identify miRNAs (microRNAs in libraries, and identify mature and star sequences in miRNAS and c to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.
Harvard Aging Brain Study: Dataset and accessibility.

Science.gov (United States)

Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G; Chatwal, Jasmeer P; Papp, Kathryn V; Amariglio, Rebecca E; Blacker, Deborah; Rentz, Dorene M; Johnson, Keith A; Sperling, Reisa A; Schultz, Aaron P

2017-01-01

The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging. To promote more extensive analyses, imaging data was designed to be compatible with other publicly available datasets. A cloud-based system enables access to interested researchers with blinded data available contingent upon completion of a data usage agreement and administrative approval. Data collection is ongoing and currently in its fifth year. Copyright © 2015 Elsevier Inc. All rights reserved.
Sensitivity of a numerical wave model on wind re-analysis datasets

Science.gov (United States)

Lavidas, George; Venugopal, Vengatesan; Friedrich, Daniel

2017-03-01

Wind is the dominant process for wave generation. Detailed evaluation of metocean conditions strengthens our understanding of issues concerning potential offshore applications. However, the scarcity of buoys and high cost of monitoring systems pose a barrier to properly defining offshore conditions. Through use of numerical wave models, metocean conditions can be hindcasted and forecasted providing reliable characterisations. This study reports the sensitivity of wind inputs on a numerical wave model for the Scottish region. Two re-analysis wind datasets with different spatio-temporal characteristics are used, the ERA-Interim Re-Analysis and the CFSR-NCEP Re-Analysis dataset. Different wind products alter results, affecting the accuracy obtained. The scope of this study is to assess different available wind databases and provide information concerning the most appropriate wind dataset for the specific region, based on temporal, spatial and geographic terms for wave modelling and offshore applications. Both wind input datasets delivered results from the numerical wave model with good correlation. Wave results by the 1-h dataset have higher peaks and lower biases, in expense of a high scatter index. On the other hand, the 6-h dataset has lower scatter but higher biases. The study shows how wind dataset affects the numerical wave modelling performance, and that depending on location and study needs, different wind inputs should be considered.
Querying Large Biological Network Datasets

Science.gov (United States)

Gulsoy, Gunhan

2013-01-01

New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…
BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten Bangla Isolated characters

Directory of Open Access Journals (Sweden)

Mithun Biswas

2017-06-01

Full Text Available BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.
Knowledge Evolution in Distributed Geoscience Datasets and the Role of Semantic Technologies

Science.gov (United States)

Ma, X.

2014-12-01

Knowledge evolves in geoscience, and the evolution is reflected in datasets. In a context with distributed data sources, the evolution of knowledge may cause considerable challenges to data management and re-use. For example, a short news published in 2009 (Mascarelli, 2009) revealed the geoscience community's concern that the International Commission on Stratigraphy's change to the definition of Quaternary may bring heavy reworking of geologic maps. Now we are in the era of the World Wide Web, and geoscience knowledge is increasingly modeled and encoded in the form of ontologies and vocabularies by using semantic technologies. Accordingly, knowledge evolution leads to a consequence called ontology dynamics. Flouris et al. (2008) summarized 10 topics of general ontology changes/dynamics such as: ontology mapping, morphism, evolution, debugging and versioning, etc. Ontology dynamics makes impacts at several stages of a data life cycle and causes challenges, such as: the request for reworking of the extant data in a data center, semantic mismatch among data sources, differentiated understanding of a same piece of dataset between data providers and data users, as well as error propagation in cross-discipline data discovery and re-use (Ma et al., 2014). This presentation will analyze the best practices in the geoscience community so far and summarize a few recommendations to reduce the negative impacts of ontology dynamics in a data life cycle, including: communities of practice and collaboration on ontology and vocabulary building, link data records to standardized terms, and methods for (semi-)automatic reworking of datasets using semantic technologies. References: Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G., 2008. Ontology change: classification and survey. The Knowledge Engineering Review 23 (2), 117-152. Ma, X., Fox, P., Rozell, E., West, P., Zednik, S., 2014. Ontology dynamics in a data life cycle: Challenges and recommendations
A dataset of human decision-making in teamwork management

Science.gov (United States)

Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

2017-01-01

Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.

EVALUATION OF LAND USE/LAND COVER DATASETS FOR URBAN WATERSHED MODELING

International Nuclear Information System (INIS)

S.J. BURIAN; M.J. BROWN; T.N. MCPHERSON

2001-01-01

Land use/land cover (LULC) data are a vital component for nonpoint source pollution modeling. Most watershed hydrology and pollutant loading models use, in some capacity, LULC information to generate runoff and pollutant loading estimates. Simple equation methods predict runoff and pollutant loads using runoff coefficients or pollutant export coefficients that are often correlated to LULC type. Complex models use input variables and parameters to represent watershed characteristics and pollutant buildup and washoff rates as a function of LULC type. Whether using simple or complex models an accurate LULC dataset with an appropriate spatial resolution and level of detail is paramount for reliable predictions. The study presented in this paper compared and evaluated several LULC dataset sources for application in urban environmental modeling. The commonly used USGS LULC datasets have coarser spatial resolution and lower levels of classification than other LULC datasets. In addition, the USGS datasets do not accurately represent the land use in areas that have undergone significant land use change during the past two decades. We performed a watershed modeling analysis of three urban catchments in Los Angeles, California, USA to investigate the relative difference in average annual runoff volumes and total suspended solids (TSS) loads when using the USGS LULC dataset versus using a more detailed and current LULC dataset. When the two LULC datasets were aggregated to the same land use categories, the relative differences in predicted average annual runoff volumes and TSS loads from the three catchments were 8 to 14% and 13 to 40%, respectively. The relative differences did not have a predictable relationship with catchment size
Sharing Video Datasets in Design Research

DEFF Research Database (Denmark)

Christensen, Bo; Abildgaard, Sille Julie Jøhnk

2017-01-01

This paper examines how design researchers, design practitioners and design education can benefit from sharing a dataset. We present the Design Thinking Research Symposium 11 (DTRS11) as an exemplary project that implied sharing video data of design processes and design activity in natural settings...... with a large group of fellow academics from the international community of Design Thinking Research, for the purpose of facilitating research collaboration and communication within the field of Design and Design Thinking. This approach emphasizes the social and collaborative aspects of design research, where...... a multitude of appropriate perspectives and methods may be utilized in analyzing and discussing the singular dataset. The shared data is, from this perspective, understood as a design object in itself, which facilitates new ways of working, collaborating, studying, learning and educating within the expanding...
Interpolation of diffusion weighted imaging datasets

DEFF Research Database (Denmark)

Dyrby, Tim B; Lundell, Henrik; Burke, Mark W

2014-01-01

anatomical details and signal-to-noise-ratio for reliable fibre reconstruction. We assessed the potential benefits of interpolating DWI datasets to a higher image resolution before fibre reconstruction using a diffusion tensor model. Simulations of straight and curved crossing tracts smaller than or equal......Diffusion weighted imaging (DWI) is used to study white-matter fibre organisation, orientation and structural connectivity by means of fibre reconstruction algorithms and tractography. For clinical settings, limited scan time compromises the possibilities to achieve high image resolution for finer...... interpolation methods fail to disentangle fine anatomical details if PVE is too pronounced in the original data. As for validation we used ex-vivo DWI datasets acquired at various image resolutions as well as Nissl-stained sections. Increasing the image resolution by a factor of eight yielded finer geometrical...
Resampling Methods Improve the Predictive Power of Modeling in Class-Imbalanced Datasets

Directory of Open Access Journals (Sweden)

Paul H. Lee

2014-09-01

Full Text Available In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES wave 2009–2010 dataset. A total of 4677 participants aged ≥20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset, and area under the receiver operating characteristic curve (AUC was computed using the remaining 30% of the sample for evaluation (testing dataset. CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees were also examined. CARTs fitted on the oversampled (AUC = 0.70 and undersampled training data (AUC = 0.74 yielded a better classification power than that on the training data (AUC = 0.65. Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests
BASE MAP DATASET, INYO COUNTY, OKLAHOMA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, JACKSON COUNTY, OKLAHOMA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, KINGFISHER COUNTY, OKLAHOMA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
Image segmentation evaluation for very-large datasets

Science.gov (United States)

Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

2016-03-01

With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.
A New Dataset Size Reduction Approach for PCA-Based Classification in OCR Application

Directory of Open Access Journals (Sweden)

Mohammad Amin Shayegan

2014-01-01

Full Text Available A major problem of pattern recognition systems is due to the large volume of training datasets including duplicate and similar training samples. In order to overcome this problem, some dataset size reduction and also dimensionality reduction techniques have been introduced. The algorithms presently used for dataset size reduction usually remove samples near to the centers of classes or support vector samples between different classes. However, the samples near to a class center include valuable information about the class characteristics and the support vector is important for evaluating system efficiency. This paper reports on the use of Modified Frequency Diagram technique for dataset size reduction. In this new proposed technique, a training dataset is rearranged and then sieved. The sieved training dataset along with automatic feature extraction/selection operation using Principal Component Analysis is used in an OCR application. The experimental results obtained when using the proposed system on one of the biggest handwritten Farsi/Arabic numeral standard OCR datasets, Hoda, show about 97% accuracy in the recognition rate. The recognition speed increased by 2.28 times, while the accuracy decreased only by 0.7%, when a sieved version of the dataset, which is only as half as the size of the initial training dataset, was used.
The CMS dataset bookkeeping service

Science.gov (United States)

Afaq, A.; Dolgert, A.; Guo, Y.; Jones, C.; Kosyakov, S.; Kuznetsov, V.; Lueking, L.; Riley, D.; Sekhri, V.

2008-07-01

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.
The CMS dataset bookkeeping service

Energy Technology Data Exchange (ETDEWEB)

Afaq, A; Guo, Y; Kosyakov, S; Lueking, L; Sekhri, V [Fermilab, Batavia, Illinois 60510 (United States); Dolgert, A; Jones, C; Kuznetsov, V; Riley, D [Cornell University, Ithaca, New York 14850 (United States)

2008-07-15

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.
The CMS dataset bookkeeping service

International Nuclear Information System (INIS)

Afaq, A; Guo, Y; Kosyakov, S; Lueking, L; Sekhri, V; Dolgert, A; Jones, C; Kuznetsov, V; Riley, D

2008-01-01

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems
The CMS dataset bookkeeping service

International Nuclear Information System (INIS)

Afaq, Anzar; Dolgert, Andrew; Guo, Yuyi; Jones, Chris; Kosyakov, Sergey; Kuznetsov, Valentin; Lueking, Lee; Riley, Dan; Sekhri, Vijay

2007-01-01

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems
Recent Upgrades to NASA SPoRT Initialization Datasets for the Environmental Modeling System

Science.gov (United States)

Case, Jonathan L.; Lafontaine, Frank J.; Molthan, Andrew L.; Zavodsky, Bradley T.; Rozumalski, Robert A.

2012-01-01

The NASA Short-term Prediction Research and Transition (SPoRT) Center has developed several products for its NOAA/National Weather Service (NWS) partners that can initialize specific fields for local model runs within the NOAA/NWS Science and Training Resource Center Environmental Modeling System (EMS). The suite of SPoRT products for use in the EMS consists of a Sea Surface Temperature (SST) composite that includes a Lake Surface Temperature (LST) analysis over the Great Lakes, a Great Lakes sea-ice extent within the SST composite, a real-time Green Vegetation Fraction (GVF) composite, and NASA Land Information System (LIS) gridded output. This paper and companion poster describe each dataset and provide recent upgrades made to the SST, Great Lakes LST, GVF composites, and the real-time LIS runs.
A cross-country Exchange Market Pressure (EMP dataset

Directory of Open Access Journals (Sweden)

Mohit Desai

2017-06-01

Full Text Available The data presented in this article are related to the research article titled - “An exchange market pressure measure for cross country analysis” (Patnaik et al. [1]. In this article, we present the dataset for Exchange Market Pressure values (EMP for 139 countries along with their conversion factors, ρ (rho. Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values for the point estimates of ρ’s. Using the standard errors of estimates of ρ’s, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.
A cross-country Exchange Market Pressure (EMP) dataset.

Science.gov (United States)

Desai, Mohit; Patnaik, Ila; Felman, Joshua; Shah, Ajay

2017-06-01

The data presented in this article are related to the research article titled - "An exchange market pressure measure for cross country analysis" (Patnaik et al. [1]). In this article, we present the dataset for Exchange Market Pressure values (EMP) for 139 countries along with their conversion factors, ρ (rho). Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values) for the point estimates of ρ 's. Using the standard errors of estimates of ρ 's, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.
A dataset for examining trends in publication of new Australian insects

Directory of Open Access Journals (Sweden)

Robert Mesibov

2014-07-01

Full Text Available Australian Faunal Directory data were used to create a new, publicly available dataset, nai50, which lists 18318 species and subspecies names for Australian insects described in the period 1961–2010, together with associated publishing data. The number of taxonomic publications introducing the new names varied little around a long-term average of 70 per year, with ca 420 new names published per year during the 30-year period 1981–2010. Within this stable pattern there were steady increases in multi-authored and 'Smith in Jones and Smith' names, and a decline in publication of names in entomology journals and books. For taxonomic works published in Australia, a publications peak around 1990 reflected increases in museum, scientific society and government agency publishing, but a subsequent decline is largely explained by a steep drop in the number of papers on insect taxonomy published by Australia's national science agency, CSIRO.
The NASA Subsonic Jet Particle Image Velocimetry (PIV) Dataset

Science.gov (United States)

Bridges, James; Wernet, Mark P.

2011-01-01

Many tasks in fluids engineering require prediction of turbulence of jet flows. The present document documents the single-point statistics of velocity, mean and variance, of cold and hot jet flows. The jet velocities ranged from 0.5 to 1.4 times the ambient speed of sound, and temperatures ranged from unheated to static temperature ratio 2.7. Further, the report assesses the accuracies of the data, e.g., establish uncertainties for the data. This paper covers the following five tasks: (1) Document acquisition and processing procedures used to create the particle image velocimetry (PIV) datasets. (2) Compare PIV data with hotwire and laser Doppler velocimetry (LDV) data published in the open literature. (3) Compare different datasets acquired at the same flow conditions in multiple tests to establish uncertainties. (4) Create a consensus dataset for a range of hot jet flows, including uncertainty bands. (5) Analyze this consensus dataset for self-consistency and compare jet characteristics to those of the open literature. The final objective was fulfilled by using the potential core length and the spread rate of the half-velocity radius to collapse of the mean and turbulent velocity fields over the first 20 jet diameters.
A global dataset of sub-daily rainfall indices

Science.gov (United States)

Fowler, H. J.; Lewis, E.; Blenkinsop, S.; Guerreiro, S.; Li, X.; Barbero, R.; Chan, S.; Lenderink, G.; Westra, S.

2017-12-01

It is still uncertain how hydrological extremes will change with global warming as we do not fully understand the processes that cause extreme precipitation under current climate variability. The INTENSE project is using a novel and fully-integrated data-modelling approach to provide a step-change in our understanding of the nature and drivers of global precipitation extremes and change on societally relevant timescales, leading to improved high-resolution climate model representation of extreme rainfall processes. The INTENSE project is in conjunction with the World Climate Research Programme (WCRP)'s Grand Challenge on 'Understanding and Predicting Weather and Climate Extremes' and the Global Water and Energy Exchanges Project (GEWEX) Science questions. A new global sub-daily precipitation dataset has been constructed (data collection is ongoing). Metadata for each station has been calculated, detailing record lengths, missing data, station locations. A set of global hydroclimatic indices have been produced based upon stakeholder recommendations including indices that describe maximum rainfall totals and timing, the intensity, duration and frequency of storms, frequency of storms above specific thresholds and information about the diurnal cycle. This will provide a unique global data resource on sub-daily precipitation whose derived indices will be freely available to the wider scientific community.
Multi-facetted Metadata - Describing datasets with different metadata schemas at the same time

Science.gov (United States)

Ulbricht, Damian; Klump, Jens; Bertelmann, Roland

2013-04-01

Inspired by the wish to re-use research data a lot of work is done to bring data systems of the earth sciences together. Discovery metadata is disseminated to data portals to allow building of customized indexes of catalogued dataset items. Data that were once acquired in the context of a scientific project are open for reappraisal and can now be used by scientists that were not part of the original research team. To make data re-use easier, measurement methods and measurement parameters must be documented in an application metadata schema and described in a written publication. Linking datasets to publications - as DataCite [1] does - requires again a specific metadata schema and every new use context of the measured data may require yet another metadata schema sharing only a subset of information with the meta information already present. To cope with the problem of metadata schema diversity in our common data repository at GFZ Potsdam we established a solution to store file-based research data and describe these with an arbitrary number of metadata schemas. Core component of the data repository is an eSciDoc infrastructure that provides versioned container objects, called eSciDoc [2] "items". The eSciDoc content model allows assigning files to "items" and adding any number of metadata records to these "items". The eSciDoc items can be submitted, revised, and finally published, which makes the data and metadata available through the internet worldwide. GFZ Potsdam uses eSciDoc to support its scientific publishing workflow, including mechanisms for data review in peer review processes by providing temporary web links for external reviewers that do not have credentials to access the data. Based on the eSciDoc API, panMetaDocs [3] provides a web portal for data management in research projects. PanMetaDocs, which is based on panMetaWorks [4], is a PHP based web application that allows to describe data with any XML-based schema. It uses the eSciDoc infrastructures

To the Geoportal and Beyond! Preparing the Earth Observing Laboratory's Datasets for Inter-Repository Discovery

Science.gov (United States)

Gordon, S.; Dattore, E.; Williams, S.

2014-12-01

Even when a data center makes it's datasets accessible, they can still be hard to discover if the user is unaware of the laboratory or organization the data center supports. NCAR's Earth Observing Laboratory (EOL) is no exception. In response to this problem and as an inquiry into the feasibility of inter-connecting all of NCAR's repositories at a discovery layer, ESRI's Geoportal was researched. It was determined that an implementation of Geoportal would be a good choice to build a proof of concept model of inter-repository discovery around. This collaborative project between the University of Illinois and NCAR is coordinated through the Data Curation Education in Research Centers program. This program is funded by the Institute of Museum and Library Services.Geoportal is open source software. It serves as an aggregation point for metadata catalogs of earth science datasets, with a focus on geospatial information. EOL's metadata is in static THREDDS catalogs. Geoportal can only create records from a THREDDS Data Server. The first step was to make EOL metadata more accessible by utilizing the ISO 19115-2 standard. It was also decided to create DIF records so EOL datasets could be ingested in NASA's Global Change Master Directory (GCMD). To offer records for harvest, it was decided to develop an OAI-PMH server. To make a compliant server, the OAI_DC standard was also implemented. A server was written in Perl to serve a set of static records. We created a sample set of records in ISO 19115-2, FGDC, DIF, and OAI_DC. We utilized GCMD shared vocabularies to enhance discoverability and precision. The proof of concept was tested and verified by having another NCAR laboratory's Geoportal harvest our sample set. To prepare for production, templates for each standard were developed and mapped to the database. These templates will help the automated creation of records. Once the OAI-PMH server is re-written in a Grails framework a dynamic representation of EOL's metadata will
Knowledge Mining from Clinical Datasets Using Rough Sets and Backpropagation Neural Network

Directory of Open Access Journals (Sweden)

Kindie Biredagn Nahato

2015-01-01

Full Text Available The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.
Individuals with greater science literacy and education have more polarized beliefs on controversial science topics

Science.gov (United States)

2017-01-01

Although Americans generally hold science in high regard and respect its findings, for some contested issues, such as the existence of anthropogenic climate change, public opinion is polarized along religious and political lines. We ask whether individuals with more general education and greater science knowledge, measured in terms of science education and science literacy, display more (or less) polarized beliefs on several such issues. We report secondary analyses of a nationally representative dataset (the General Social Survey), examining the predictors of beliefs regarding six potentially controversial issues. We find that beliefs are correlated with both political and religious identity for stem cell research, the Big Bang, and human evolution, and with political identity alone on climate change. Individuals with greater education, science education, and science literacy display more polarized beliefs on these issues. We find little evidence of political or religious polarization regarding nanotechnology and genetically modified foods. On all six topics, people who trust the scientific enterprise more are also more likely to accept its findings. We discuss the causal mechanisms that might underlie the correlation between education and identity-based polarization. PMID:28827344
Information Rx Program Requires More Promotion, More Support and Some Adjustment. A Review of: Meeks, K. (2009. Information Rx: Promotion and utilization by Georgia librarians and the Georgia American College of Physicians. Journal of Consumer Health on the Internet, 13, 129-134.

Directory of Open Access Journals (Sweden)

Matthew Thomas

2010-12-01

Full Text Available Objective – To determine the level of awareness of the Information Rx program by Georgia librarians and Georgia American College of Physicians (GACP members, and the use of Information Rx pads, with which physicians would “prescribe” information for their patients.Design – Descriptive (surveys and interviews.Setting – Georgia, U.S. Surveys were distributed and responded to online. The face-to-face interview locations were not specified.Subjects – One survey, which was provided to the Georgia American College of Physicians (GACP membership including internal medicine physicians and medical students interested in internal medicine, had 46 respondents. The second survey was sent to librarians who were members of the Georgia Public Library Service (GPLS and the Georgia Health Sciences Library Association (GHSLA. There were 72 public librarians, 14 hospital librarians and 13 academic medical librarians who responded (as well as 6 not specified in the article. A select group of four medical librarians was chosen for more in-depth interviews. The number of surveys sent out was not provided.Methods – Two online surveys, one for physicians and one for librarians, were administered. No information concerning response rate was provided. Face-to-face interviews with four academic medical librarians were conducted. No further information about the interviewing process was provided such as who conducted the interviews, methods used to ensure objectivity or consistency, or where the interviews were conducted.Main Results – Out of 46 GACP survey respondents, only 4 were familiar with the Information Rx program and only 2 of those had used the information Rx pads, neither of whom had referred anyone to a library for further assistance. The two who had not used the pads were either too busy or didn’t understand the program well enough. Of 105 librarian survey respondents, 46 had heard of Information Rx, 37 had received Information Rx promotional
Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

Science.gov (United States)

Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

2014-01-01

Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution
The Global Precipitation Climatology Project (GPCP) Combined Precipitation Dataset

Science.gov (United States)

Huffman, George J.; Adler, Robert F.; Arkin, Philip; Chang, Alfred; Ferraro, Ralph; Gruber, Arnold; Janowiak, John; McNab, Alan; Rudolf, Bruno; Schneider, Udo

1997-01-01

The Global Precipitation Climatology Project (GPCP) has released the GPCP Version 1 Combined Precipitation Data Set, a global, monthly precipitation dataset covering the period July 1987 through December 1995. The primary product in the dataset is a merged analysis incorporating precipitation estimates from low-orbit-satellite microwave data, geosynchronous-orbit -satellite infrared data, and rain gauge observations. The dataset also contains the individual input fields, a combination of the microwave and infrared satellite estimates, and error estimates for each field. The data are provided on 2.5 deg x 2.5 deg latitude-longitude global grids. Preliminary analyses show general agreement with prior studies of global precipitation and extends prior studies of El Nino-Southern Oscillation precipitation patterns. At the regional scale there are systematic differences with standard climatologies.
A new dataset and algorithm evaluation for mood estimation in music

OpenAIRE

Godec, Primož

2014-01-01

This thesis presents a new dataset of perceived and induced emotions for 200 audio clips. The gathered dataset provides users' perceived and induced emotions for each clip, the association of color, along with demographic and personal data, such as user's emotion state and emotion ratings, genre preference, music experience, among others. With an online survey we collected more than 7000 responses for a dataset of 200 audio excerpts, thus providing about 37 user responses per clip. The foc...
Measuring student engagement in science classrooms: An investigation of the contextual factors and longitudinal outcomes

Science.gov (United States)

Spicer, Justina Judy

This dissertation includes three separate but related studies that examine the different dimensions of student experiences in science using data from two different datasets: the High School Longitudinal Study of 2009 (HSLS:09), and a dataset constructed using the Experience Sampling Method (ESM). This mixed-dataset approach provides a unique perspective on student engagement and the contexts in which it exists. Engagement is operationalized across the three studies using aspects of flow theory to evaluate how the challenges in science classes are experienced at the student level. The data provides information on a student's skill-level and efficacy during the challenge, as well as their interest level and persistence. The data additionally track how situations contribute to optimal learning moments, along with longitudinal attitudes and behaviors towards science. In the first part of this study, the construct of optimal moments is explored using in the moment data from the ESM dataset. Several different measures of engagement are tested and validated to uncover relationships between various affective states and optimal learning experiences with a focus on science classrooms. Additional analyses include investigating the links between in the moment engagement (situational), and cross-situational (stable) measures of engagement in science. The second part of this dissertation analyzes the ESM data in greater depth by examining how engagement varies across students and their contextual environment. The contextual characteristics associated with higher engagement levels are evaluated to see if these conditions hold across different types of students. Chapter three more thoroughly analyzes what contributes to students persisting through challenging learning moments, and the variation in levels of effort put forth when facing difficulty while learning in science. In chapter four, this dissertation explores additional outcomes associated with student engagement in science
A Large-Scale 3D Object Recognition dataset

DEFF Research Database (Denmark)

Sølund, Thomas; Glent Buch, Anders; Krüger, Norbert

2016-01-01

geometric groups; concave, convex, cylindrical and flat 3D object models. The object models have varying amount of local geometric features to challenge existing local shape feature descriptors in terms of descriptiveness and robustness. The dataset is validated in a benchmark which evaluates the matching...... performance of 7 different state-of-the-art local shape descriptors. Further, we validate the dataset in a 3D object recognition pipeline. Our benchmark shows as expected that local shape feature descriptors without any global point relation across the surface have a poor matching performance with flat...
The Wind Integration National Dataset (WIND) toolkit (Presentation)

Energy Technology Data Exchange (ETDEWEB)

Caroline Draxl: NREL

2014-01-01

Regional wind integration studies require detailed wind power output data at many locations to perform simulations of how the power system will operate under high penetration scenarios. The wind datasets that serve as inputs into the study must realistically reflect the ramping characteristics, spatial and temporal correlations, and capacity factors of the simulated wind plants, as well as being time synchronized with available load profiles.As described in this presentation, the WIND Toolkit fulfills these requirements by providing a state-of-the-art national (US) wind resource, power production and forecast dataset.
Comparison of global 3-D aviation emissions datasets

Directory of Open Access Journals (Sweden)

S. C. Olsen

2013-01-01

Full Text Available Aviation emissions are unique from other transportation emissions, e.g., from road transportation and shipping, in that they occur at higher altitudes as well as at the surface. Aviation emissions of carbon dioxide, soot, and water vapor have direct radiative impacts on the Earth's climate system while emissions of nitrogen oxides (NO_x, sulfur oxides, carbon monoxide (CO, and hydrocarbons (HC impact air quality and climate through their effects on ozone, methane, and clouds. The most accurate estimates of the impact of aviation on air quality and climate utilize three-dimensional chemistry-climate models and gridded four dimensional (space and time aviation emissions datasets. We compare five available aviation emissions datasets currently and historically used to evaluate the impact of aviation on climate and air quality: NASA-Boeing 1992, NASA-Boeing 1999, QUANTIFY 2000, Aero2k 2002, and AEDT 2006 and aviation fuel usage estimates from the International Energy Agency. Roughly 90% of all aviation emissions are in the Northern Hemisphere and nearly 60% of all fuelburn and NO_x emissions occur at cruise altitudes in the Northern Hemisphere. While these datasets were created by independent methods and are thus not strictly suitable for analyzing trends they suggest that commercial aviation fuelburn and NO_x emissions increased over the last two decades while HC emissions likely decreased and CO emissions did not change significantly. The bottom-up estimates compared here are consistently lower than International Energy Agency fuelburn statistics although the gap is significantly smaller in the more recent datasets. Overall the emissions distributions are quite similar for fuelburn and NO_x with regional peaks over the populated land masses of North America, Europe, and East Asia. For CO and HC there are relatively larger differences. There are however some distinct differences in the altitude distribution
Toward discovery science of human brain function

DEFF Research Database (Denmark)

Biswal, Bharat B; Mennes, Maarten; Zuo, Xi-Nian

2010-01-01

Although it is being successfully implemented for exploration of the genome, discovery science has eluded the functional neuroimaging community. The core challenge remains the development of common paradigms for interrogating the myriad functional systems in the brain without the constraints...... individual's functional connectome exhibits unique features, with stable, meaningful interindividual differences in connectivity patterns and strengths. Comprehensive mapping of the functional connectome, and its subsequent exploitation to discern genetic influences and brain-behavior relationships...... in the brain. To initiate discovery science of brain function, the 1000 Functional Connectomes Project dataset is freely accessible at www.nitrc.org/projects/fcon_1000/....
What Is Citizen Science? – A Scientometric Meta-Analysis

Science.gov (United States)

Kullenberg, Christopher; Kasperowski, Dick

2016-01-01

Context The concept of citizen science (CS) is currently referred to by many actors inside and outside science and research. Several descriptions of this purportedly new approach of science are often heard in connection with large datasets and the possibilities of mobilizing crowds outside science to assists with observations and classifications. However, other accounts refer to CS as a way of democratizing science, aiding concerned communities in creating data to influence policy and as a way of promoting political decision processes involving environment and health. Objective In this study we analyse two datasets (N = 1935, N = 633) retrieved from the Web of Science (WoS) with the aim of giving a scientometric description of what the concept of CS entails. We account for its development over time, and what strands of research that has adopted CS and give an assessment of what scientific output has been achieved in CS-related projects. To attain this, scientometric methods have been combined with qualitative approaches to render more precise search terms. Results Results indicate that there are three main focal points of CS. The largest is composed of research on biology, conservation and ecology, and utilizes CS mainly as a methodology of collecting and classifying data. A second strand of research has emerged through geographic information research, where citizens participate in the collection of geographic data. Thirdly, there is a line of research relating to the social sciences and epidemiology, which studies and facilitates public participation in relation to environmental issues and health. In terms of scientific output, the largest body of articles are to be found in biology and conservation research. In absolute numbers, the amount of publications generated by CS is low (N = 1935), but over the past decade a new and very productive line of CS based on digital platforms has emerged for the collection and classification of data. PMID:26766577
Getting the full picture: Assessing the complementarity of citizen science and agency monitoring data

OpenAIRE

Hadj-Hammou, Jeneen; Loiselle, Steven; Ophof, Daniel; Thornhill, Ian

2017-01-01

While the role of citizen science in engaging the public and providing large-scale datasets has been demonstrated, the nature of and potential for this science to supplement environmental monitoring efforts by government agencies has not yet been fully explored. To this end, the present study investigates the complementarity of a citizen science programme to agency monitoring of water quality. The Environment Agency (EA) is the governmental public body responsible for, among other duties, man...
Interacting with Petabytes of Earth Science Data using Jupyter Notebooks, IPython Widgets and Google Earth Engine

Science.gov (United States)

Erickson, T. A.; Granger, B.; Grout, J.; Corlay, S.

2017-12-01

The volume of Earth science data gathered from satellites, aircraft, drones, and field instruments continues to increase. For many scientific questions in the Earth sciences, managing this large volume of data is a barrier to progress, as it is difficult to explore and analyze large volumes of data using the traditional paradigm of downloading datasets to a local computer for analysis. Furthermore, methods for communicating Earth science algorithms that operate on large datasets in an easily understandable and reproducible way are needed. Here we describe a system for developing, interacting, and sharing well-documented Earth Science algorithms that combines existing software components: Jupyter Notebook: An open-source, web-based environment that supports documents that combine code and computational results with text narrative, mathematics, images, and other media. These notebooks provide an environment for interactive exploration of data and development of well documented algorithms. Jupyter Widgets / ipyleaflet: An architecture for creating interactive user interface controls (such as sliders, text boxes, etc.) in Jupyter Notebooks that communicate with Python code. This architecture includes a default set of UI controls (sliders, dropboxes, etc.) as well as APIs for building custom UI controls. The ipyleaflet project is one example that offers a custom interactive map control that allows a user to display and manipulate geographic data within the Jupyter Notebook. Google Earth Engine: A cloud-based geospatial analysis platform that provides access to petabytes of Earth science data via a Python API. The combination of Jupyter Notebooks, Jupyter Widgets, ipyleaflet, and Google Earth Engine makes it possible to explore and analyze massive Earth science datasets via a web browser, in an environment suitable for interactive exploration, teaching, and sharing. Using these environments can make Earth science analyses easier to understand and reproducible, which may
Global Human Built-up And Settlement Extent (HBASE) Dataset From Landsat

Data.gov (United States)

National Aeronautics and Space Administration — The Global Human Built-up And Settlement Extent (HBASE) Dataset from Landsat is a global map of HBASE derived from the Global Land Survey (GLS) Landsat dataset for...
Passive Containment DataSet

Science.gov (United States)

This data is for Figures 6 and 7 in the journal article. The data also includes the two EPANET input files used for the analysis described in the paper, one for the looped system and one for the block system.This dataset is associated with the following publication:Grayman, W., R. Murray , and D. Savic. Redesign of Water Distribution Systems for Passive Containment of Contamination. JOURNAL OF THE AMERICAN WATER WORKS ASSOCIATION. American Water Works Association, Denver, CO, USA, 108(7): 381-391, (2016).
Assessing the Impact of Gender and Race on Earnings in the Library Science Labor Market

Science.gov (United States)

Sweeper, Darren; Smith, Steven A.

2010-01-01

Using data from the 2003 National Survey of College Graduates, this paper examines earnings in the library science labor market and assesses the impact of gender on the income attainment process. We use this cross-sectional dataset to determine if there are significant income differences between male and female library science professionals. The…
The Lunar Source Disk: Old Lunar Datasets on a New CD-ROM

Science.gov (United States)

Hiesinger, H.

1998-01-01

A compilation of previously published datasets on CD-ROM is presented. This Lunar Source Disk is intended to be a first step in the improvement/expansion of the Lunar Consortium Disk, in order to create an "image-cube"-like data pool that can be easily accessed and might be useful for a variety of future lunar investigations. All datasets were transformed to a standard map projection that allows direct comparison of different types of information on a pixel-by pixel basis. Lunar observations have a long history and have been important to mankind for centuries, notably since the work of Plutarch and Galileo. As a consequence of centuries of lunar investigations, knowledge of the characteristics and properties of the Moon has accumulated over time. However, a side effect of this accumulation is that it has become more and more complicated for scientists to review all the datasets obtained through different techniques, to interpret them properly, to recognize their weaknesses and strengths in detail, and to combine them synoptically in geologic interpretations. Such synoptic geologic interpretations are crucial for the study of planetary bodies through remote-sensing data in order to avoid misinterpretation. In addition, many of the modem datasets, derived from Earth-based telescopes as well as from spacecraft missions, are acquired at different geometric and radiometric conditions. These differences make it challenging to compare or combine datasets directly or to extract information from different datasets on a pixel-by-pixel basis. Also, as there is no convention for the presentation of lunar datasets, different authors choose different map projections, depending on the location of the investigated areas and their personal interests. Insufficient or incomplete information on the map parameters used by different authors further complicates the reprojection of these datasets to a standard geometry. The goal of our efforts was to transfer previously published lunar
Gridded 5km GHCN-Daily Temperature and Precipitation Dataset, Version 1

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — The Gridded 5km GHCN-Daily Temperature and Precipitation Dataset (nClimGrid) consists of four climate variables derived from the GHCN-D dataset: maximum temperature,...

ENHANCED DATA DISCOVERABILITY FOR IN SITU HYPERSPECTRAL DATASETS

Directory of Open Access Journals (Sweden)

B. Rasaiah

2016-06-01

Full Text Available Field spectroscopic metadata is a central component in the quality assurance, reliability, and discoverability of hyperspectral data and the products derived from it. Cataloguing, mining, and interoperability of these datasets rely upon the robustness of metadata protocols for field spectroscopy, and on the software architecture to support the exchange of these datasets. Currently no standard for in situ spectroscopy data or metadata protocols exist. This inhibits the effective sharing of growing volumes of in situ spectroscopy datasets, to exploit the benefits of integrating with the evolving range of data sharing platforms. A core metadataset for field spectroscopy was introduced by Rasaiah et al., (2011-2015 with extended support for specific applications. This paper presents a prototype model for an OGC and ISO compliant platform-independent metadata discovery service aligned to the specific requirements of field spectroscopy. In this study, a proof-of-concept metadata catalogue has been described and deployed in a cloud-based architecture as a demonstration of an operationalized field spectroscopy metadata standard and web-based discovery service.
Environmental Dataset Gateway (EDG) CS-W Interface

Data.gov (United States)

U.S. Environmental Protection Agency — Use the Environmental Dataset Gateway (EDG) to find and access EPA's environmental resources. Many options are available for easily reusing EDG content in other...
Knowledge discovery in large model datasets in the marine environment: the THREDDS Data Server example

Directory of Open Access Journals (Sweden)

A. Bergamasco

2012-06-01

Full Text Available In order to monitor, describe and understand the marine environment, many research institutions are involved in the acquisition and distribution of ocean data, both from observations and models. Scientists from these institutions are spending too much time looking for, accessing, and reformatting data: they need better tools and procedures to make the science they do more efficient. The U.S. Integrated Ocean Observing System (US-IOOS is working on making large amounts of distributed data usable in an easy and efficient way. It is essentially a network of scientists, technicians and technologies designed to acquire, collect and disseminate observational and modelled data resulting from coastal and oceanic marine regions investigations to researchers, stakeholders and policy makers. In order to be successful, this effort requires standard data protocols, web services and standards-based tools. Starting from the US-IOOS approach, which is being adopted throughout much of the oceanographic and meteorological sectors, we describe here the CNR-ISMAR Venice experience in the direction of setting up a national Italian IOOS framework using the THREDDS (THematic Real-time Environmental Distributed Data Services Data Server (TDS, a middleware designed to fill the gap between data providers and data users. The TDS provides services that allow data users to find the data sets pertaining to their scientific needs, to access, to visualize and to use them in an easy way, without downloading files to the local workspace. In order to achieve this, it is necessary that the data providers make their data available in a standard form that the TDS understands, and with sufficient metadata to allow the data to be read and searched in a standard way. The core idea is then to utilize a Common Data Model (CDM, a unified conceptual model that describes different datatypes within each dataset. More specifically, Unidata (www.unidata.ucar.edu has developed CDM
Annotating spatio-temporal datasets for meaningful analysis in the Web

Science.gov (United States)

Stasch, Christoph; Pebesma, Edzer; Scheider, Simon

2014-05-01

More and more environmental datasets that vary in space and time are available in the Web. This comes along with an advantage of using the data for other purposes than originally foreseen, but also with the danger that users may apply inappropriate analysis procedures due to lack of important assumptions made during the data collection process. In order to guide towards a meaningful (statistical) analysis of spatio-temporal datasets available in the Web, we have developed a Higher-Order-Logic formalism that captures some relevant assumptions in our previous work [1]. It allows to proof on meaningful spatial prediction and aggregation in a semi-automated fashion. In this poster presentation, we will present a concept for annotating spatio-temporal datasets available in the Web with concepts defined in our formalism. Therefore, we have defined a subset of the formalism as a Web Ontology Language (OWL) pattern. It allows capturing the distinction between the different spatio-temporal variable types, i.e. point patterns, fields, lattices and trajectories, that in turn determine whether a particular dataset can be interpolated or aggregated in a meaningful way using a certain procedure. The actual annotations that link spatio-temporal datasets with the concepts in the ontology pattern are provided as Linked Data. In order to allow data producers to add the annotations to their datasets, we have implemented a Web portal that uses a triple store at the backend to store the annotations and to make them available in the Linked Data cloud. Furthermore, we have implemented functions in the statistical environment R to retrieve the RDF annotations and, based on these annotations, to support a stronger typing of spatio-temporal datatypes guiding towards a meaningful analysis in R. [1] Stasch, C., Scheider, S., Pebesma, E., Kuhn, W. (2014): "Meaningful spatial prediction and aggregation", Environmental Modelling & Software, 51, 149-165.
Evolving hard problems: Generating human genetics datasets with a complex etiology

Directory of Open Access Journals (Sweden)

Himmelstein Daniel S

2011-07-01

Full Text Available Abstract Background A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Results Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. Conclusions This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.
A Dataset from TIMSS to Examine the Relationship between Computer Use and Mathematics Achievement

Science.gov (United States)

Kadijevich, Djordje M.

2015-01-01

Because the relationship between computer use and achievement is still puzzling, there is a need to prepare and analyze good quality datasets on computer use and achievement. Such a dataset can be derived from TIMSS data. This paper describes how this dataset can be prepared. It also gives an example of how the dataset may be analyzed. The…
Enabling distributed petascale science

International Nuclear Information System (INIS)

Baranovski, Andrew; Bharathi, Shishir; Bresnahan, John

2007-01-01

Petascale science is an end-to-end endeavour, involving not only the creation of massive datasets at supercomputers or experimental facilities, but the subsequent analysis of that data by a user community that may be distributed across many laboratories and universities. The new SciDAC Center for Enabling Distributed Petascale Science (CEDPS) is developing tools to support this end-to-end process. These tools include data placement services for the reliable, high-performance, secure, and policy-driven placement of data within a distributed science environment; tools and techniques for the construction, operation, and provisioning of scalable science services; and tools for the detection and diagnosis of failures in end-to-end data placement and distributed application hosting configurations. In each area, we build on a strong base of existing technology and have made useful progress in the first year of the project. For example, we have recently achieved order-of-magnitude improvements in transfer times (for lots of small files) and implemented asynchronous data staging capabilities; demonstrated dynamic deployment of complex application stacks for the STAR experiment; and designed and deployed end-to-end troubleshooting services. We look forward to working with SciDAC application and technology projects to realize the promise of petascale science
The Ethics of Big Data and Nursing Science.

Science.gov (United States)

Milton, Constance L

2017-10-01

Big data is a scientific, social, and technological trend referring to the process and size of datasets available for analysis. Ethical implications arise as healthcare disciplines, including nursing, struggle over questions of informed consent, privacy, ownership of data, and its possible use in epistemology. The author offers straight-thinking possibilities for the use of big data in nursing science.
Data Recommender: An Alternative Way to Discover Open Scientific Datasets

Science.gov (United States)

Klump, J. F.; Devaraju, A.; Williams, G.; Hogan, D.; Davy, R.; Page, J.; Singh, D.; Peterson, N.

2017-12-01

Over the past few years, institutions and government agencies have adopted policies to openly release their data, which has resulted in huge amounts of open data becoming available on the web. When trying to discover the data, users face two challenges: an overload of choice and the limitations of the existing data search tools. On the one hand, there are too many datasets to choose from, and therefore, users need to spend considerable effort to find the datasets most relevant to their research. On the other hand, data portals commonly offer keyword and faceted search, which depend fully on the user queries to search and rank relevant datasets. Consequently, keyword and faceted search may return loosely related or irrelevant results, although the results may contain the same query. They may also return highly specific results that depend more on how well metadata was authored. They do not account well for variance in metadata due to variance in author styles and preferences. The top-ranked results may also come from the same data collection, and users are unlikely to discover new and interesting datasets. These search modes mainly suits users who can express their information needs in terms of the structure and terminology of the data portals, but may pose a challenge otherwise. The above challenges reflect that we need a solution that delivers the most relevant (i.e., similar and serendipitous) datasets to users, beyond the existing search functionalities on the portals. A recommender system is an information filtering system that presents users with relevant and interesting contents based on users' context and preferences. Delivering data recommendations to users can make data discovery easier, and as a result may enhance user engagement with the portal. We developed a hybrid data recommendation approach for the CSIRO Data Access Portal. The approach leverages existing recommendation techniques (e.g., content-based filtering and item co-occurrence) to produce
Data assimilation and model evaluation experiment datasets

Science.gov (United States)

Lai, Chung-Cheng A.; Qian, Wen; Glenn, Scott M.

1994-01-01

The Institute for Naval Oceanography, in cooperation with Naval Research Laboratories and universities, executed the Data Assimilation and Model Evaluation Experiment (DAMEE) for the Gulf Stream region during fiscal years 1991-1993. Enormous effort has gone into the preparation of several high-quality and consistent datasets for model initialization and verification. This paper describes the preparation process, the temporal and spatial scopes, the contents, the structure, etc., of these datasets. The goal of DAMEE and the need of data for the four phases of experiment are briefly stated. The preparation of DAMEE datasets consisted of a series of processes: (1) collection of observational data; (2) analysis and interpretation; (3) interpolation using the Optimum Thermal Interpolation System package; (4) quality control and re-analysis; and (5) data archiving and software documentation. The data products from these processes included a time series of 3D fields of temperature and salinity, 2D fields of surface dynamic height and mixed-layer depth, analysis of the Gulf Stream and rings system, and bathythermograph profiles. To date, these are the most detailed and high-quality data for mesoscale ocean modeling, data assimilation, and forecasting research. Feedback from ocean modeling groups who tested this data was incorporated into its refinement. Suggestions for DAMEE data usages include (1) ocean modeling and data assimilation studies, (2) diagnosis and theoretical studies, and (3) comparisons with locally detailed observations.
Leveraging High Resolution Topography for Education and Outreach: Updates to OpenTopography to make EarthScope and Other Lidar Datasets more Prominent in Geoscience Education

Science.gov (United States)

Kleber, E.; Crosby, C. J.; Arrowsmith, R.; Robinson, S.; Haddad, D. E.

2013-12-01

The use of Light Detection and Ranging (lidar) derived topography has become an indispensable tool in Earth science research. The collection of high-resolution lidar topography from an airborne or terrestrial platform allows landscapes and landforms to be represented at sub-meter resolution and in three dimensions. In addition to its high value for scientific research, lidar derived topography has tremendous potential as a tool for Earth science education. Recent science education initiatives and a community call for access to research-level data make the time ripe to expose lidar data and derived data products as a teaching tool. High resolution topographic data fosters several Disciplinary Core Ideas (DCIs) of the Next Generation Science Standards (NGS, 2013), presents respective Big Ideas of the new community-driven Earth Science Literacy Initiative (ESLI, 2009), teaches to a number National Science Education Standards (NSES, 1996), and Benchmarks for Science Literacy (AAAS, 1993) for science education for undergraduate physical and environmental earth science classes. The spatial context of lidar data complements concepts like visualization, place-based learning, inquiry based teaching and active learning essential to teaching in the geosciences. As official host to EarthScope lidar datasets for tectonically active areas in the western United States, the NSF-funded OpenTopography facility provides user-friendly access to a wealth of data that is easily incorporated into Earth science educational materials. OpenTopography (www.opentopography.org), in collaboration with EarthScope, has developed education and outreach activities to foster teacher, student and researcher utilization of lidar data. These educational resources use lidar data coupled with free tools such as Google Earth to provide a means for students and the interested public to visualize and explore Earth's surface in an interactive manner not possible with most other remotely sensed imagery. The
Artificial intelligence (AI) systems for interpreting complex medical datasets.

Science.gov (United States)

Altman, R B

2017-05-01

Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability. © 2017 ASCPT.
Full-Scale Approximations of Spatio-Temporal Covariance Models for Large Datasets

KAUST Repository

Zhang, Bohai; Sang, Huiyan; Huang, Jianhua Z.

2014-01-01

of dataset and application of such models is not feasible for large datasets. This article extends the full-scale approximation (FSA) approach by Sang and Huang (2012) to the spatio-temporal context to reduce computational complexity. A reversible jump Markov
PERFORMANCE COMPARISON FOR INTRUSION DETECTION SYSTEM USING NEURAL NETWORK WITH KDD DATASET

Directory of Open Access Journals (Sweden)

S. Devaraju

2014-04-01

Full Text Available Intrusion Detection Systems are challenging task for finding the user as normal user or attack user in any organizational information systems or IT Industry. The Intrusion Detection System is an effective method to deal with the kinds of problem in networks. Different classifiers are used to detect the different kinds of attacks in networks. In this paper, the performance of intrusion detection is compared with various neural network classifiers. In the proposed research the four types of classifiers used are Feed Forward Neural Network (FFNN, Generalized Regression Neural Network (GRNN, Probabilistic Neural Network (PNN and Radial Basis Neural Network (RBNN. The performance of the full featured KDD Cup 1999 dataset is compared with that of the reduced featured KDD Cup 1999 dataset. The MATLAB software is used to train and test the dataset and the efficiency and False Alarm Rate is measured. It is proved that the reduced dataset is performing better than the full featured dataset.
Review of ATLAS Open Data 8 TeV datasets, tools and activities

CERN Document Server

The ATLAS collaboration

2018-01-01

The ATLAS Collaboration has released two 8 TeV datasets and relevant simulated samples to the public for educational use. A number of groups within ATLAS have used these ATLAS Open Data 8 TeV datasets, developing tools and educational material to promote particle physics. The general aim of these activities is to provide simple and user-friendly interactive interfaces to simulate the procedures used by high-energy physics researchers. International Masterclasses introduce particle physics to high school students and have been studying 8 TeV ATLAS Open Data since 2015. Inspired by this success, a new ATLAS Open Data initiative was launched in 2016 for university students. A comprehensive educational platform was thus developed featuring a second 8 TeV dataset and a new set of educational tools. The 8 TeV datasets and associated tools are presented and discussed here, as well as a selection of activities studying the ATLAS Open Data 8 TeV datasets.
Recent Development on the NOAA's Global Surface Temperature Dataset

Science.gov (United States)

Zhang, H. M.; Huang, B.; Boyer, T.; Lawrimore, J. H.; Menne, M. J.; Rennie, J.

2016-12-01

Global Surface Temperature (GST) is one of the most widely used indicators for climate trend and extreme analyses. A widely used GST dataset is the NOAA merged land-ocean surface temperature dataset known as NOAAGlobalTemp (formerly MLOST). The NOAAGlobalTemp had recently been updated from version 3.5.4 to version 4. The update includes a significant improvement in the ocean surface component (Extended Reconstructed Sea Surface Temperature or ERSST, from version 3b to version 4) which resulted in an increased temperature trends in recent decades. Since then, advancements in both the ocean component (ERSST) and land component (GHCN-Monthly) have been made, including the inclusion of Argo float SSTs and expanded EOT modes in ERSST, and the use of ISTI databank in GHCN-Monthly. In this presentation, we describe the impact of those improvements on the merged global temperature dataset, in terms of global trends and other aspects.
Developing a Data-Set for Stereopsis

Directory of Open Access Journals (Sweden)

D.W Hunter

2014-08-01

Full Text Available Current research on binocular stereopsis in humans and non-human primates has been limited by a lack of available data-sets. Current data-sets fall into two categories; stereo-image sets with vergence but no ranging information (Hibbard, 2008, Vision Research, 48(12, 1427-1439 or combinations of depth information with binocular images and video taken from cameras in fixed fronto-parallel configurations exhibiting neither vergence or focus effects (Hirschmuller & Scharstein, 2007, IEEE Conf. Computer Vision and Pattern Recognition. The techniques for generating depth information are also imperfect. Depth information is normally inaccurate or simply missing near edges and on partially occluded surfaces. For many areas of vision research these are the most interesting parts of the image (Goutcher, Hunter, Hibbard, 2013, i-Perception, 4(7, 484; Scarfe & Hibbard, 2013, Vision Research. Using state-of-the-art open-source ray-tracing software (PBRT as a back-end, our intention is to release a set of tools that will allow researchers in this field to generate artificial binocular stereoscopic data-sets. Although not as realistic as photographs, computer generated images have significant advantages in terms of control over the final output and ground-truth information about scene depth is easily calculated at all points in the scene, even partially occluded areas. While individual researchers have been developing similar stimuli by hand for many decades, we hope that our software will greatly reduce the time and difficulty of creating naturalistic binocular stimuli. Our intension in making this presentation is to elicit feedback from the vision community about what sort of features would be desirable in such software.
BASE MAP DATASET, MAYES COUNTY, OKLAHOMA, USA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications: cadastral, geodetic control,...
PENERAPAN TEKNIK BAGGING PADA ALGORITMA KLASIFIKASI UNTUK MENGATASI KETIDAKSEIMBANGAN KELAS DATASET MEDIS

Directory of Open Access Journals (Sweden)

Rizki Tri Prasetio

2016-03-01

Full Text Available ABSTRACT – The class imbalance problems have been reported to severely hinder classiﬁcation performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different ﬁelds. Therefore, a number of methods, such as sampling methods, cost-sensitive learning methods, and bagging and boosting based ensemble methods, have been proposed to solve these problems. Some medical dataset has two classes has two classes or binominal experiencing an imbalance that causes lack of accuracy in classification. This research proposed a combination technique of bagging and algorithms of classification to improve the accuracy of medical datasets. Bagging technique used to solve the problem of imbalanced class. The proposed method is applied on three classifier algorithm i.e., naïve bayes, decision tree and k-nearest neighbor. This research uses five medical datasets obtained from UCI Machine Learning i.e.., breast-cancer, liver-disorder, heart-disease, pima-diabetes and vertebral column. Results of this research indicate that the proposed method makes a significant improvement on two algorithms of classification i.e. decision tree with p value of t-Test 0.0184 and k-nearest neighbor with p value of t-Test 0.0292, but not significant in naïve bayes with p value of t-Test 0.9236. After bagging technique applied at five medical datasets, naïve bayes has the highest accuracy for breast-cancer dataset of 96.14% with AUC of 0.984, heart-disease of 84.44% with AUC of 0.911 and pima-diabetes of 74.73% with AUC of 0.806. While the k-nearest neighbor has the best accuracy for dataset liver-disorder of 62.03% with AUC of 0.632 and vertebral-column of 82.26% with the AUC of 0.867. Keywords: ensemble technique, bagging, imbalanced class, medical dataset. ABSTRAKSI – Masalah ketidakseimbangan kelas telah dilaporkan sangat menghambat kinerja klasifikasi banyak algoritma klasifikasi dan telah menarik banyak perhatian dari
CERC Dataset (Full Hadza Data)

DEFF Research Database (Denmark)

2016-01-01

The dataset includes demographic, behavioral, and religiosity data from eight different populations from around the world. The samples were drawn from: (1) Coastal and (2) Inland Tanna, Vanuatu; (3) Hadzaland, Tanzania; (4) Lovu, Fiji; (5) Pointe aux Piment, Mauritius; (6) Pesqueiro, Brazil; (7......) Kyzyl, Tyva Republic; and (8) Yasawa, Fiji. Related publication: Purzycki, et al. (2016). Moralistic Gods, Supernatural Punishment and the Expansion of Human Sociality. Nature, 530(7590): 327-330....

Error characterisation of global active and passive microwave soil moisture datasets

Directory of Open Access Journals (Sweden)

W. A. Dorigo

2010-12-01

Full Text Available Understanding the error structures of remotely sensed soil moisture observations is essential for correctly interpreting observed variations and trends in the data or assimilating them in hydrological or numerical weather prediction models. Nevertheless, a spatially coherent assessment of the quality of the various globally available datasets is often hampered by the limited availability over space and time of reliable in-situ measurements. As an alternative, this study explores the triple collocation error estimation technique for assessing the relative quality of several globally available soil moisture products from active (ASCAT and passive (AMSR-E and SSM/I microwave sensors. The triple collocation is a powerful statistical tool to estimate the root mean square error while simultaneously solving for systematic differences in the climatologies of a set of three linearly related data sources with independent error structures. Prerequisite for this technique is the availability of a sufficiently large number of timely corresponding observations. In addition to the active and passive satellite-based datasets, we used the ERA-Interim and GLDAS-NOAH reanalysis soil moisture datasets as a third, independent reference. The prime objective is to reveal trends in uncertainty related to different observation principles (passive versus active, the use of different frequencies (C-, X-, and Ku-band for passive microwave observations, and the choice of the independent reference dataset (ERA-Interim versus GLDAS-NOAH. The results suggest that the triple collocation method provides realistic error estimates. Observed spatial trends agree well with the existing theory and studies on the performance of different observation principles and frequencies with respect to land cover and vegetation density. In addition, if all theoretical prerequisites are fulfilled (e.g. a sufficiently large number of common observations is available and errors of the different
Tracking and Establishing Provenance of Earth Science Datasets: A NASA-Based Example

Science.gov (United States)

Ramapriyan, Hampapuram K.; Goldstein, Justin C.; Hua, Hook; Wolfe, Robert E.

2016-01-01

Information quality is of paramount importance to science. Accurate, scientifically vetted and statistically meaningful and, ideally, reproducible information engenders scientific trust and research opportunities. Not surprisingly, federal bodies (e.g., NASA, NOAA, USGS) have very strictly affirmed the importance of information quality in their product requirements. So-called Highly Influential Scientific Assessments (HISA) such as The Third US National Climate Assessment (NCA3) published in 2014 undergo a very rigorous review process to ensure transparency and credibility. To support the transparency of such reports, the U.S. Global Change Research Program (USGCRP) has developed the Global Change Information System (GCIS). A recent activity was performed to trace the provenance as completely as possible for all NCA3 figures that were predominantly based on NASA data. This poster presents the mechanics of that project and the lessons learned from that activity.
Data science for mental health: a UK perspective on a global challenge.

Science.gov (United States)

McIntosh, Andrew M; Stewart, Robert; John, Ann; Smith, Daniel J; Davis, Katrina; Sudlow, Cathie; Corvin, Aiden; Nicodemus, Kristin K; Kingdon, David; Hassan, Lamiece; Hotopf, Matthew; Lawrie, Stephen M; Russ, Tom C; Geddes, John R; Wolpert, Miranda; Wölbert, Eva; Porteous, David J

2016-10-01

Data science uses computer science and statistics to extract new knowledge from high-dimensional datasets (ie, those with many different variables and data types). Mental health research, diagnosis, and treatment could benefit from data science that uses cohort studies, genomics, and routine health-care and administrative data. The UK is well placed to trial these approaches through robust NHS-linked data science projects, such as the UK Biobank, Generation Scotland, and the Clinical Record Interactive Search (CRIS) programme. Data science has great potential as a low-cost, high-return catalyst for improved mental health recognition, understanding, support, and outcomes. Lessons learnt from such studies could have global implications. Copyright © 2016 Elsevier Ltd. All rights reserved.
Synthetic ALSPAC longitudinal datasets for the Big Data VR project.

Science.gov (United States)

Avraam, Demetris; Wilson, Rebecca C; Burton, Paul

2017-01-01

Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information. In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.
Correction of elevation offsets in multiple co-located lidar datasets

Science.gov (United States)

Thompson, David M.; Dalyander, P. Soupy; Long, Joseph W.; Plant, Nathaniel G.

2017-04-07

IntroductionTopographic elevation data collected with airborne light detection and ranging (lidar) can be used to analyze short- and long-term changes to beach and dune systems. Analysis of multiple lidar datasets at Dauphin Island, Alabama, revealed systematic, island-wide elevation differences on the order of 10s of centimeters (cm) that were not attributable to real-world change and, therefore, were likely to represent systematic sampling offsets. These offsets vary between the datasets, but appear spatially consistent within a given survey. This report describes a method that was developed to identify and correct offsets between lidar datasets collected over the same site at different times so that true elevation changes over time, associated with sediment accumulation or erosion, can be analyzed.
BASE MAP DATASET, HONOLULU COUNTY, HAWAII, USA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, LOS ANGELES COUNTY, CALIFORNIA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, CHEROKEE COUNTY, SOUTH CAROLINA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, EDGEFIELD COUNTY, SOUTH CAROLINA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, SANTA CRIZ COUNTY, CALIFORNIA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
Satellite-Based Precipitation Datasets

Science.gov (United States)

Munchak, S. J.; Huffman, G. J.

2017-12-01

Of the possible sources of precipitation data, those based on satellites provide the greatest spatial coverage. There is a wide selection of datasets, algorithms, and versions from which to choose, which can be confusing to non-specialists wishing to use the data. The International Precipitation Working Group (IPWG) maintains tables of the major publicly available, long-term, quasi-global precipitation data sets (http://www.isac.cnr.it/ ipwg/data/datasets.html), and this talk briefly reviews the various categories. As examples, NASA provides two sets of quasi-global precipitation data sets: the older Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) and current Integrated Multi-satellitE Retrievals for Global Precipitation Measurement (GPM) mission (IMERG). Both provide near-real-time and post-real-time products that are uniformly gridded in space and time. The TMPA products are 3-hourly 0.25°x0.25° on the latitude band 50°N-S for about 16 years, while the IMERG products are half-hourly 0.1°x0.1° on 60°N-S for over 3 years (with plans to go to 16+ years in Spring 2018). In addition to the precipitation estimates, each data set provides fields of other variables, such as the satellite sensor providing estimates and estimated random error. The discussion concludes with advice about determining suitability for use, the necessity of being clear about product names and versions, and the need for continued support for satellite- and surface-based observation.
FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.

Science.gov (United States)

Shcherbina, Anna

2014-08-15

High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible. FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step. FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge
Scientific data management in the environmental molecular sciences laboratory

Energy Technology Data Exchange (ETDEWEB)

Bernard, P.R.; Keller, T.L.

1995-09-01

The Environmental Molecular Sciences Laboratory (EMSL) is currently under construction at Pacific Northwest Laboratory (PNL) for the U.S. Department of Energy (DOE). This laboratory will be used for molecular and environmental sciences research to identify comprehensive solutions to DOE`s environmental problems. Major facilities within the EMSL include the Molecular Sciences Computing Facility (MSCF), a laser-surface dynamics laboratory, a high-field nuclear magnetic resonance (NMR) laboratory, and a mass spectrometry laboratory. The EMSL is scheduled to open early in 1997 and will house about 260 resident and visiting scientists. It is anticipated that at least six (6) terabytes of data will be archived in the first year of operation. An object-oriented database management system (OODBMS) and a mass storage system will be integrated to provide an intelligent, automated mechanism to manage data. The resulting system, called the DataBase Computer System (DBCS), will provide total scientific data management capabilities to EMSL users. A prototype mass storage system based on the National Storage Laboratory`s (NSL) UniTree has been procured and is in limited use. This system consists of two independent hierarchies of storage devices. One hierarchy of lower capacity, slower speed devices provides support for smaller files transferred over the Fiber Distributed Data Interface (FDDI) network. Also part of the system is a second hierarchy of higher capacity, higher speed devices that will be used to support high performance clients (e.g., a large scale parallel processor). The ObjectStore OODBMS will be used to manage metadata for archived datasets, maintain relationships between archived datasets, and -hold small, duplicate subsets of archived datasets (i.e., derivative data). The interim system is called DBCS, Phase 0 (DBCS-0). The production system for the EMSL, DBCS Phase 1 (DBCS-1), will be procured and installed in the summer of 1996.
Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

Science.gov (United States)

Yoon, Chun Hong; Demirci, Hasan; Sierra, Raymond G.; Dao, E. Han; Ahmadi, Radman; Aksit, Fulya; Aquila, Andrew L.; Batyuk, Alexander; Ciftci, Halilibrahim; Guillet, Serge; Hayes, Matt J.; Hayes, Brandon; Lane, Thomas J.; Liang, Meng; Lundström, Ulf; Koglin, Jason E.; Mgbam, Paul; Rao, Yashas; Rendahl, Theodore; Rodriguez, Evan; Zhang, Lindsey; Wakatsuki, Soichi; Boutet, Sébastien; Holton, James M.; Hunter, Mark S.

2017-04-01

We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C10H16N2O3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operate simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development.
Dataset of transcriptional landscape of B cell early activation

Directory of Open Access Journals (Sweden)

Alexander S. Garruss

2015-09-01

Full Text Available Signaling via B cell receptors (BCR and Toll-like receptors (TLRs result in activation of B cells with distinct physiological outcomes, but transcriptional regulatory mechanisms that drive activation and distinguish these pathways remain unknown. At early time points after BCR and TLR ligand exposure, 0.5 and 2 h, RNA-seq was performed allowing observations on rapid transcriptional changes. At 2 h, ChIP-seq was performed to allow observations on important regulatory mechanisms potentially driving transcriptional change. The dataset includes RNA-seq, ChIP-seq of control (Input, RNA Pol II, H3K4me3, H3K27me3, and a separate RNA-seq for miRNA expression, which can be found at Gene Expression Omnibus Dataset GSE61608. Here, we provide details on the experimental and analysis methods used to obtain and analyze this dataset and to examine the transcriptional landscape of B cell early activation.
U.S. Climate Divisional Dataset (Version Superseded)

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — This data has been superseded by a newer version of the dataset. Please refer to NOAA's Climate Divisional Database for more information. The U.S. Climate Divisional...
Clickstream data yields high-resolution maps of science

Energy Technology Data Exchange (ETDEWEB)

Bollen, Johan [Los Alamos National Laboratory; Van De Sompel, Herbert [Los Alamos National Laboratory; Hagberg, Aric [Los Alamos National Laboratory; Bettencourt, Luis [Los Alamos National Laboratory; Chute, Ryan [Los Alamos National Laboratory; Rodriguez, Marko A [Los Alamos National Laboratory; Balakireva, Lyudmila [Los Alamos National Laboratory

2009-01-01

Intricate maps of science have been created from citation data to visualize the structure of scientific activity. However, most scientific publications are now accessed online. Scholarly web portals record detailed log data at a scale that exceeds the number of all existing citations combined. Such log data is recorded immediately upon publication and keeps track of the sequences of user requests (clickstreams) that are issued by a variety of users across many different domains. Given these advantagees of log datasets over citation data, we investigate whether they can produce high-resolution, more current maps of science.
From Sky to Earth: Data Science Methodology Transfer

Science.gov (United States)

Mahabal, Ashish A.; Crichton, Daniel; Djorgovski, S. G.; Law, Emily; Hughes, John S.

2017-06-01

We describe here the parallels in astronomy and earth science datasets, their analyses, and the opportunities for methodology transfer from astroinformatics to geoinformatics. Using example of hydrology, we emphasize how meta-data and ontologies are crucial in such an undertaking. Using the infrastructure being designed for EarthCube - the Virtual Observatory for the earth sciences - we discuss essential steps for better transfer of tools and techniques in the future e.g. domain adaptation. Finally we point out that it is never a one-way process and there is enough for astroinformatics to learn from geoinformatics as well.
UK surveillance: provision of quality assured information from combined datasets.

Science.gov (United States)

Paiba, G A; Roberts, S R; Houston, C W; Williams, E C; Smith, L H; Gibbens, J C; Holdship, S; Lysons, R

2007-09-14

Surveillance information is most useful when provided within a risk framework, which is achieved by presenting results against an appropriate denominator. Often the datasets are captured separately and for different purposes, and will have inherent errors and biases that can be further confounded by the act of merging. The United Kingdom Rapid Analysis and Detection of Animal-related Risks (RADAR) system contains data from several sources and provides both data extracts for research purposes and reports for wider stakeholders. Considerable efforts are made to optimise the data in RADAR during the Extraction, Transformation and Loading (ETL) process. Despite efforts to ensure data quality, the final dataset inevitably contains some data errors and biases, most of which cannot be rectified during subsequent analysis. So, in order for users to establish the 'fitness for purpose' of data merged from more than one data source, Quality Statements are produced as defined within the overarching surveillance Quality Framework. These documents detail identified data errors and biases following ETL and report construction as well as relevant aspects of the datasets from which the data originated. This paper illustrates these issues using RADAR datasets, and describes how they can be minimised.
Advanced Methodologies for NASA Science Missions

Science.gov (United States)

Hurlburt, N. E.; Feigelson, E.; Mentzel, C.

2017-12-01

Most of NASA's commitment to computational space science involves the organization and processing of Big Data from space-based satellites, and the calculations of advanced physical models based on these datasets. But considerable thought is also needed on what computations are needed. The science questions addressed by space data are so diverse and complex that traditional analysis procedures are often inadequate. The knowledge and skills of the statistician, applied mathematician, and algorithmic computer scientist must be incorporated into programs that currently emphasize engineering and physical science. NASA's culture and administrative mechanisms take full cognizance that major advances in space science are driven by improvements in instrumentation. But it is less well recognized that new instruments and science questions give rise to new challenges in the treatment of satellite data after it is telemetered to the ground. These issues might be divided into two stages: data reduction through software pipelines developed within NASA mission centers; and science analysis that is performed by hundreds of space scientists dispersed through NASA, U.S. universities, and abroad. Both stages benefit from the latest statistical and computational methods; in some cases, the science result is completely inaccessible using traditional procedures. This paper will review the current state of NASA and present example applications using modern methodologies.

Climate Prediction Center IR 4km Dataset

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — CPC IR 4km dataset was created from all available individual geostationary satellite data which have been merged to form nearly seamless global (60N-60S) IR...
Multivariate Analysis of Multiple Datasets: a Practical Guide for Chemical Ecology.

Science.gov (United States)

Hervé, Maxime R; Nicolè, Florence; Lê Cao, Kim-Anh

2018-03-01

Chemical ecology has strong links with metabolomics, the large-scale study of all metabolites detectable in a biological sample. Consequently, chemical ecologists are often challenged by the statistical analyses of such large datasets. This holds especially true when the purpose is to integrate multiple datasets to obtain a holistic view and a better understanding of a biological system under study. The present article provides a comprehensive resource to analyze such complex datasets using multivariate methods. It starts from the necessary pre-treatment of data including data transformations and distance calculations, to the application of both gold standard and novel multivariate methods for the integration of different omics data. We illustrate the process of analysis along with detailed results interpretations for six issues representative of the different types of biological questions encountered by chemical ecologists. We provide the necessary knowledge and tools with reproducible R codes and chemical-ecological datasets to practice and teach multivariate methods.
Challenges and Experiences of Building Multidisciplinary Datasets across Cultures

Science.gov (United States)

Jamiyansharav, K.; Laituri, M.; Fernandez-Gimenez, M.; Fassnacht, S. R.; Venable, N. B. H.; Allegretti, A. M.; Reid, R.; Baival, B.; Jamsranjav, C.; Ulambayar, T.; Linn, S.; Angerer, J.

2017-12-01

Efficient data sharing and management are key challenges to multidisciplinary scientific research. These challenges are further complicated by adding a multicultural component. We address the construction of a complex database for social-ecological analysis in Mongolia. Funded by the National Science Foundation (NSF) Dynamics of Coupled Natural and Human (CNH) Systems, the Mongolian Rangelands and Resilience (MOR2) project focuses on the vulnerability of Mongolian pastoral systems to climate change and adaptive capacity. The MOR2 study spans over three years of fieldwork in 36 paired districts (Soum) from 18 provinces (Aimag) of Mongolia that covers steppe, mountain forest steppe, desert steppe and eastern steppe ecological zones. Our project team is composed of hydrologists, social scientists, geographers, and ecologists. The MOR2 database includes multiple ecological, social, meteorological, geospatial and hydrological datasets, as well as archives of original data and survey in multiple formats. Managing this complex database requires significant organizational skills, attention to detail and ability to communicate within collective team members from diverse disciplines and across multiple institutions in the US and Mongolia. We describe the database's rich content, organization, structure and complexity. We discuss lessons learned, best practices and recommendations for complex database management, sharing, and archiving in creating a cross-cultural and multi-disciplinary database.
Earth Science Data Analytics: Preparing for Extracting Knowledge from Information

Science.gov (United States)

Kempler, Steven; Barbieri, Lindsay

2016-01-01

Data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information. Data analytics is a broad term that includes data analysis, as well as an understanding of the cognitive processes an analyst uses to understand problems and explore data in meaningful ways. Analytics also include data extraction, transformation, and reduction, utilizing specific tools, techniques, and methods. Turning to data science, definitions of data science sound very similar to those of data analytics (which leads to a lot of the confusion between the two). But the skills needed for both, co-analyzing large amounts of heterogeneous data, understanding and utilizing relevant tools and techniques, and subject matter expertise, although similar, serve different purposes. Data Analytics takes on a practitioners approach to applying expertise and skills to solve issues and gain subject knowledge. Data Science, is more theoretical (research in itself) in nature, providing strategic actionable insights and new innovative methodologies. Earth Science Data Analytics (ESDA) is the process of examining, preparing, reducing, and analyzing large amounts of spatial (multi-dimensional), temporal, or spectral data using a variety of data types to uncover patterns, correlations and other information, to better understand our Earth. The large variety of datasets (temporal spatial differences, data types, formats, etc.) invite the need for data analytics skills that understand the science domain, and data preparation, reduction, and analysis techniques, from a practitioners point of view. The application of these skills to ESDA is the focus of this presentation. The Earth Science Information Partners (ESIP) Federation Earth Science Data Analytics (ESDA) Cluster was created in recognition of the practical need to facilitate the co-analysis of large amounts of data and information for Earth science. Thus, from a to
GRIIDC: A Data Repository for Gulf of Mexico Science

Science.gov (United States)

Ellis, S.; Gibeaut, J. C.

2017-12-01

The Gulf of Mexico Research Initiative Information & Data Cooperative (GRIIDC) system is a data management solution appropriate for any researcher sharing Gulf of Mexico and oil spill science data. Our mission is to ensure a data and information legacy that promotes continual scientific discovery and public awareness of the Gulf of Mexico ecosystem. GRIIDC developed an open-source software solution to manage data from the Gulf of Mexico Research Initiative (GoMRI). The GoMRI program has over 2500 researchers from diverse fields of study with a variety of attitudes, experiences, and capacities for data sharing. The success of this solution is apparent through new partnerships to share data generated by RESTORE Act Centers of Excellence Programs, the National Academies of Science, and others. The GRIIDC data management system integrates dataset management planning, metadata creation, persistent identification, and data discoverability into an easy-to-use web application. No specialized software or program installations are required to support dataset submission or discovery. Furthermore, no data transformations are needed to submit data to GRIIDC; common file formats such as Excel, csv, and text are all acceptable for submissions. To ensure data are properly documented using the GRIIDC implementation of the ISO 19115-2 metadata standard, researchers submit detailed descriptive information through a series of interactive forms and no knowledge of metadata or xml formats are required. Once a dataset is documented and submitted the GRIIDC team performs a review of the dataset package. This review ensures that files can be opened and contain data, and that data are completely and accurately described. This review does not include performing quality assurance or control of data points, as GRIIDC expects scientists to perform these steps during the course of their work. Once approved, data are made public and searchable through the GRIIDC data discovery portal and the Data
Harvard Aging Brain Study : Dataset and accessibility

NARCIS (Netherlands)

Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G.; Chatwal, Jasmeer P.; Papp, Kathryn V.; Amariglio, Rebecca E.; Blacker, Deborah; Rentz, Dorene M.; Johnson, Keith A.; Sperling, Reisa A.; Schultz, Aaron P.

2017-01-01

The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging.
Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

Science.gov (United States)

Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

2017-12-01

Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.
Comparing the accuracy of food outlet datasets in an urban environment

Directory of Open Access Journals (Sweden)

Michelle S. Wong

2017-05-01

Full Text Available Studies that investigate the relationship between the retail food environment and health outcomes often use geospatial datasets. Prior studies have identified challenges of using the most common data sources. Retail food environment datasets created through academic-government partnership present an alternative, but their validity (retail existence, type, location has not been assessed yet. In our study, we used ground-truth data to compare the validity of two datasets, a 2015 commercial dataset (InfoUSA and data collected from 2012 to 2014 through the Maryland Food Systems Mapping Project (MFSMP, an academic-government partnership, on the retail food environment in two low-income, inner city neighbourhoods in Baltimore City. We compared sensitivity and positive predictive value (PPV of the commercial and academic-government partnership data to ground-truth data for two broad categories of unhealthy food retailers: small food retailers and quick-service restaurants. Ground-truth data was collected in 2015 and analysed in 2016. Compared to the ground-truth data, MFSMP and InfoUSA generally had similar sensitivity that was greater than 85%. MFSMP had higher PPV compared to InfoUSA for both small food retailers (MFSMP: 56.3% vs InfoUSA: 40.7% and quick-service restaurants (MFSMP: 58.6% vs InfoUSA: 36.4%. We conclude that data from academic-government partnerships like MFSMP might be an attractive alternative option and improvement to relying only on commercial data. Other research institutes or cities might consider efforts to create and maintain such an environmental dataset. Even if these datasets cannot be updated on an annual basis, they are likely more accurate than commercial data.
Comparing the accuracy of food outlet datasets in an urban environment.

Science.gov (United States)

Wong, Michelle S; Peyton, Jennifer M; Shields, Timothy M; Curriero, Frank C; Gudzune, Kimberly A

2017-05-11

Studies that investigate the relationship between the retail food environment and health outcomes often use geospatial datasets. Prior studies have identified challenges of using the most common data sources. Retail food environment datasets created through academic-government partnership present an alternative, but their validity (retail existence, type, location) has not been assessed yet. In our study, we used ground-truth data to compare the validity of two datasets, a 2015 commercial dataset (InfoUSA) and data collected from 2012 to 2014 through the Maryland Food Systems Mapping Project (MFSMP), an academic-government partnership, on the retail food environment in two low-income, inner city neighbourhoods in Baltimore City. We compared sensitivity and positive predictive value (PPV) of the commercial and academic-government partnership data to ground-truth data for two broad categories of unhealthy food retailers: small food retailers and quick-service restaurants. Ground-truth data was collected in 2015 and analysed in 2016. Compared to the ground-truth data, MFSMP and InfoUSA generally had similar sensitivity that was greater than 85%. MFSMP had higher PPV compared to InfoUSA for both small food retailers (MFSMP: 56.3% vs InfoUSA: 40.7%) and quick-service restaurants (MFSMP: 58.6% vs InfoUSA: 36.4%). We conclude that data from academic-government partnerships like MFSMP might be an attractive alternative option and improvement to relying only on commercial data. Other research institutes or cities might consider efforts to create and maintain such an environmental dataset. Even if these datasets cannot be updated on an annual basis, they are likely more accurate than commercial data.
Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

Directory of Open Access Journals (Sweden)

H. E. Beck

2017-12-01

Full Text Available We undertook a comprehensive evaluation of 22 gridded (quasi-global (sub-daily precipitation (P datasets for the period 2000–2016. Thirteen non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76 086 gauges worldwide. Another nine gauge-corrected datasets were evaluated using hydrological modeling, by calibrating the HBV conceptual model against streamflow records for each of 9053 small to medium-sized ( <  50 000 km2 catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR and the satellite- and reanalysis-based CHIRP V2.0 dataset, the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7 or near-surface soil moisture (SM2RAIN-ASCAT, and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS. Two of the three reanalyses (ERA-Interim and JRA-55 unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified, and MSWEP V1.2 and V2.0 generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU, which in turn outperformed the one indirectly incorporating gauge data through another multi-source dataset (PERSIANN-CDR V1R1. Our results highlight large differences in estimation accuracy
Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.

Science.gov (United States)

Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi

2015-09-01

Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the
NASA/NOAA: Earth Science Electronic Theater 1999

Science.gov (United States)

Hasler, A. Fritz

1999-01-01

The Electronic Theater (E-theater) presents visualizations which span the period from the original Suomi/Hasler animations of the first ATS-1 GEO weather satellite images in 1966 to the latest 1999 NASA Earth Science Vision for the next 25 years. Hot off the SGI-Onyx Graphics-Supercomputer are NASA's visualizations of Hurricanes Mitch, Georges, Fran and Linda. These storms have been recently featured on the covers of National Geographic, Time, Newsweek and Popular Science. Highlights will be shown from the NASA hurricane visualization resource video tape that has been used repeatedly this season on National and International network TV. Results will be presented from a new paper on automatic wind measurements in Hurricane Luis from 1-min GOES images that appeared in the November BAMS. The visualizations are produced by the NASA Goddard Visualization and Analysis Laboratory (VAL/912), and Scientific Visualization Studio (SVS/930), as well as other Goddard and NASA groups using NASA, NOAA, ESA, and NASDA Earth science datasets. Visualizations will be shown from the Earth Science E-Theater 1999 recently presented in Tokyo, Paris, Munich, Sydney, Melbourne, Honolulu, Washington, New York, and Dallas. The presentation Jan 11-14 at the AMS meeting in Dallas used a 4-CPU SGI/CRAY Onyx Infinite Reality Super Graphics Workstation with 8 GB RAM and a Terabyte Disk at 3840 X 1024 resolution with triple synchronized BarcoReality 9200 projectors on a 60ft wide screen. Visualizations will also be featured from the new Earth Today Exhibit which was opened by Vice President Gore on July 2, 1998 at the Smithsonian Air & Space museum in Washington, as well as those presented for possible use at the American Museum of Natural History (NYC), Disney EPCOT, and other venues. New methods are demonstrated for visualizing, interpreting, comparing, organizing and analyzing immense HyperImage remote sensing datasets and three dimensional numerical model results. We call the data from many
Access and scientific exploitation of planetary plasma datasets with the CDPP/AMDA web-based tool

Science.gov (United States)

Andre, Nicolas

2012-07-01

The field of planetary sciences has greatly expanded in recent years with space missions orbiting around most of the planets of our Solar System. The growing amount and wealth of data available make it difficult for scientists to exploit data coming from many sources that can initially be heterogeneous in their organization, description and format. It is an important objective of the Europlanet-RI (supported by EU within FP7) to add value to space missions by significantly contributing to the effective scientific exploitation of collected data; to enable space researchers to take full advantage of the potential value of data sets. To this end and to enhance the science return from space missions, innovative tools have to be developed and offered to the community. AMDA (Automated Multi-Dataset Analysis, http://cdpp-amda.cesr.fr/) is a web-based facility developed at CDPP Toulouse in France (http://cdpp.cesr.fr) for on line analysis of space physics data (heliosphere, magnetospheres, planetary environments) coming from either its local database or distant ones. AMDA has been recently integrated as a service to the scientific community for the Plasma Physics thematic node of the Europlanet-RI IDIS (Integrated and Distributed Information Service, http://www.europlanet-idis.fi/) activities, in close cooperation with IWF Graz (http://europlanet-plasmanode.oeaw.ac.at/index.php?id=9). We will report the status of our current technical and scientific efforts to integrate in the local database of AMDA various planetary plasma datasets (at Mercury, Venus, Mars, Earth and Moon, Jupiter, Saturn) from heterogeneous sources, including NASA/Planetary Data System (http://ppi.pds.nasa.gov/). We will also present our prototype Virtual Observatory activities to connect the AMDA tool to the IVOA Aladin astrophysical tool to enable pluridisciplinary studies of giant planet auroral emissions. This presentation will be done on behalf of the CDPP Team and Europlanet-RI IDIS plasma node
A multimodal dataset for authoring and editing multimedia content: The MAMEM project

Directory of Open Access Journals (Sweden)

Spiros Nikolopoulos

2017-12-01

Full Text Available We present a dataset that combines multimodal biosignals and eye tracking information gathered under a human-computer interaction framework. The dataset was developed in the vein of the MAMEM project that aims to endow people with motor disabilities with the ability to edit and author multimedia content through mental commands and gaze activity. The dataset includes EEG, eye-tracking, and physiological (GSR and Heart rate signals collected from 34 individuals (18 able-bodied and 16 motor-impaired. Data were collected during the interaction with specifically designed interface for web browsing and multimedia content manipulation and during imaginary movement tasks. The presented dataset will contribute towards the development and evaluation of modern human-computer interaction systems that would foster the integration of people with severe motor impairments back into society.
An integrated dataset for in silico drug discovery

Directory of Open Access Journals (Sweden)

Cockell Simon J

2010-12-01

Full Text Available Drug development is expensive and prone to failure. It is potentially much less risky and expensive to reuse a drug developed for one condition for treating a second disease, than it is to develop an entirely new compound. Systematic approaches to drug repositioning are needed to increase throughput and find candidates more reliably. Here we address this need with an integrated systems biology dataset, developed using the Ondex data integration platform, for the in silico discovery of new drug repositioning candidates. We demonstrate that the information in this dataset allows known repositioning examples to be discovered. We also propose a means of automating the search for new treatment indications of existing compounds.
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

Science.gov (United States)

Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene

2018-01-01

Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie
An innovative privacy preserving technique for incremental datasets on cloud computing.

Science.gov (United States)

Aldeen, Yousra Abdul Alsahib S; Salleh, Mazleena; Aljeroudi, Yazan

2016-08-01

Cloud computing (CC) is a magnificent service-based delivery with gigantic computer processing power and data storage across connected communications channels. It imparted overwhelming technological impetus in the internet (web) mediated IT industry, where users can easily share private data for further analysis and mining. Furthermore, user affable CC services enable to deploy sundry applications economically. Meanwhile, simple data sharing impelled various phishing attacks and malware assisted security threats. Some privacy sensitive applications like health services on cloud that are built with several economic and operational benefits necessitate enhanced security. Thus, absolute cyberspace security and mitigation against phishing blitz became mandatory to protect overall data privacy. Typically, diverse applications datasets are anonymized with better privacy to owners without providing all secrecy requirements to the newly added records. Some proposed techniques emphasized this issue by re-anonymizing the datasets from the scratch. The utmost privacy protection over incremental datasets on CC is far from being achieved. Certainly, the distribution of huge datasets volume across multiple storage nodes limits the privacy preservation. In this view, we propose a new anonymization technique to attain better privacy protection with high data utility over distributed and incremental datasets on CC. The proficiency of data privacy preservation and improved confidentiality requirements is demonstrated through performance evaluation. Copyright © 2016 Elsevier Inc. All rights reserved.
TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

KAUST Repository

Mü ller, Matthias; Bibi, Adel Aamer; Giancola, Silvio; Al-Subaihi, Salman; Ghanem, Bernard

2018-01-01

Despite the numerous developments in object tracking, further development of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.
TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

KAUST Repository

Müller, Matthias

2018-03-28

Despite the numerous developments in object tracking, further development of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.
CitSci.org: A New Model for Managing, Documenting, and Sharing Citizen Science Data.

Directory of Open Access Journals (Sweden)

Yiwei Wang

2015-10-01

Full Text Available Citizen science projects have the potential to advance science by increasing the volume and variety of data, as well as innovation. Yet this potential has not been fully realized, in part because citizen science data are typically not widely shared and reused. To address this and related challenges, we built CitSci.org (see www.citsci.org, a customizable platform that allows users to collect and generate diverse datasets. We hope that CitSci.org will ultimately increase discoverability and confidence in citizen science observations, encouraging scientists to use such data in their own scientific research.

CitSci.org: A New Model for Managing, Documenting, and Sharing Citizen Science Data.

Science.gov (United States)

Wang, Yiwei; Kaplan, Nicole; Newman, Greg; Scarpino, Russell

2015-10-01

Citizen science projects have the potential to advance science by increasing the volume and variety of data, as well as innovation. Yet this potential has not been fully realized, in part because citizen science data are typically not widely shared and reused. To address this and related challenges, we built CitSci.org (see www.citsci.org), a customizable platform that allows users to collect and generate diverse datasets. We hope that CitSci.org will ultimately increase discoverability and confidence in citizen science observations, encouraging scientists to use such data in their own scientific research.
Parton Distributions based on a Maximally Consistent Dataset

Science.gov (United States)

Rojo, Juan

2016-04-01

The choice of data that enters a global QCD analysis can have a substantial impact on the resulting parton distributions and their predictions for collider observables. One of the main reasons for this has to do with the possible presence of inconsistencies, either internal within an experiment or external between different experiments. In order to assess the robustness of the global fit, different definitions of a conservative PDF set, that is, a PDF set based on a maximally consistent dataset, have been introduced. However, these approaches are typically affected by theory biases in the selection of the dataset. In this contribution, after a brief overview of recent NNPDF developments, we propose a new, fully objective, definition of a conservative PDF set, based on the Bayesian reweighting approach. Using the new NNPDF3.0 framework, we produce various conservative sets, which turn out to be mutually in agreement within the respective PDF uncertainties, as well as with the global fit. We explore some of their implications for LHC phenomenology, finding also good consistency with the global fit result. These results provide a non-trivial validation test of the new NNPDF3.0 fitting methodology, and indicate that possible inconsistencies in the fitted dataset do not affect substantially the global fit PDFs.
Decoys Selection in Benchmarking Datasets: Overview and Perspectives

Science.gov (United States)

Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu

2018-01-01

Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509
Multiresolution persistent homology for excessively large biomolecular datasets

Energy Technology Data Exchange (ETDEWEB)

Xia, Kelin; Zhao, Zhixiong [Department of Mathematics, Michigan State University, East Lansing, Michigan 48824 (United States); Wei, Guo-Wei, E-mail: wei@math.msu.edu [Department of Mathematics, Michigan State University, East Lansing, Michigan 48824 (United States); Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824 (United States); Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824 (United States)

2015-10-07

Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.
The DataBridge: A System For Optimizing The Use Of Dark Data From The Long Tail Of Science

Science.gov (United States)

Lander, H.; Rajasekar, A.

2015-12-01

The DataBridge is a National Science Foundation funded collaborative project (OCI-1247652, OCI-1247602, OCI-1247663) designed to assist in the discovery of dark data sets from the long tail of science. The DataBridge aims to to build queryable communities of datasets using sociometric network analysis. This approach is being tested to evaluate the ability to leverage various forms of metadata to facilitate discovery of new knowledge. Each dataset in the Databridge has an associated name space used as a first level partitioning. In addition to testing known algorithms for SNA community building, the DataBridge project has built a message-based platform that allows users to provide their own algorithms for each of the stages in the community building process. The stages are: Signature Generation (SG): An SG algorithm creates a metadata signature for a dataset. Signature algorithms might use text metadata provided by the dataset creator or derive metadata. Relevance Algorithm (RA): An RA compares a pair of datasets and produces a similarity value between 0 and 1 for the two datasets. Sociometric Network Analysis (SNA): The SNA will operate on a similarity matrix produced by an RA to partition all of the datasets in the name space into a set of clusters. These clusters represent communities of closely related datasets. The DataBridge also includes a web application that produces a visual representation of the clustering. Future work includes a more complete application that will allow different types of searching of the network of datasets. The DataBridge approach is relevant to geoscience research and informatics. In this presentation we will outline the project, illustrate the deployment of the approach, and discuss other potential applications and next steps for the research such as applying this approach to models. In addition we will explore the relevance of DataBridge to other geoscience projects such as various EarthCube Building Blocks and DIBBS projects.
dataTEL - Datasets for Technology Enhanced Learning

NARCIS (Netherlands)

Drachsler, Hendrik; Verbert, Katrien; Sicilia, Miguel-Angel; Wolpers, Martin; Manouselis, Nikos; Vuorikari, Riina; Lindstaedt, Stefanie; Fischer, Frank

2011-01-01

Drachsler, H., Verbert, K., Sicilia, M. A., Wolpers, M., Manouselis, N., Vuorikari, R., Lindstaedt, S., & Fischer, F. (2011). dataTEL - Datasets for Technology Enhanced Learning. STELLAR Alpine Rendez-Vous White Paper. Alpine Rendez-Vous 2011 White paper collection, Nr. 13., France (2011)
Tissue-Based MRI Intensity Standardization: Application to Multicentric Datasets

Directory of Open Access Journals (Sweden)

Nicolas Robitaille

2012-01-01

Full Text Available Intensity standardization in MRI aims at correcting scanner-dependent intensity variations. Existing simple and robust techniques aim at matching the input image histogram onto a standard, while we think that standardization should aim at matching spatially corresponding tissue intensities. In this study, we present a novel automatic technique, called STI for STandardization of Intensities, which not only shares the simplicity and robustness of histogram-matching techniques, but also incorporates tissue spatial intensity information. STI uses joint intensity histograms to determine intensity correspondence in each tissue between the input and standard images. We compared STI to an existing histogram-matching technique on two multicentric datasets, Pilot E-ADNI and ADNI, by measuring the intensity error with respect to the standard image after performing nonlinear registration. The Pilot E-ADNI dataset consisted in 3 subjects each scanned in 7 different sites. The ADNI dataset consisted in 795 subjects scanned in more than 50 different sites. STI was superior to the histogram-matching technique, showing significantly better intensity matching for the brain white matter with respect to the standard image.
Exploring massive, genome scale datasets with the genometricorr package

KAUST Repository

Favorov, Alexander; Mularoni, Loris; Cope, Leslie M.; Medvedeva, Yulia; Mironov, Andrey A.; Makeev, Vsevolod J.; Wheelan, Sarah J.

2012-01-01

We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets. Availability and implementation: The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor. © 2012 Favorov et al.
Principal Component Analysis of Process Datasets with Missing Values

Directory of Open Access Journals (Sweden)

Kristen A. Severson

2017-07-01

Full Text Available Datasets with missing values arising from causes such as sensor failure, inconsistent sampling rates, and merging data from different systems are common in the process industry. Methods for handling missing data typically operate during data pre-processing, but can also occur during model building. This article considers missing data within the context of principal component analysis (PCA, which is a method originally developed for complete data that has widespread industrial application in multivariate statistical process control. Due to the prevalence of missing data and the success of PCA for handling complete data, several PCA algorithms that can act on incomplete data have been proposed. Here, algorithms for applying PCA to datasets with missing values are reviewed. A case study is presented to demonstrate the performance of the algorithms and suggestions are made with respect to choosing which algorithm is most appropriate for particular settings. An alternating algorithm based on the singular value decomposition achieved the best results in the majority of test cases involving process datasets.
Exploring massive, genome scale datasets with the genometricorr package

KAUST Repository

Favorov, Alexander

2012-05-31

We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets. Availability and implementation: The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor. © 2012 Favorov et al.
Testing the Neutral Theory of Biodiversity with Human Microbiome Datasets

OpenAIRE

Li, Lianwei; Ma, Zhanshan (Sam)

2016-01-01

The human microbiome project (HMP) has made it possible to test important ecological theories for arguably the most important ecosystem to human health?the human microbiome. Existing limited number of studies have reported conflicting evidence in the case of the neutral theory; the present study aims to comprehensively test the neutral theory with extensive HMP datasets covering all five major body sites inhabited by the human microbiome. Utilizing 7437 datasets of bacterial community samples...
Self-Reported Juvenile Firesetting: Results from Two National Survey Datasets

OpenAIRE

Howell Bowling, Carrie; Merrick, Joav; Omar, Hatim A.

2013-01-01

The main purpose of this study was to address gaps in existing research by examining the relationship between academic performance and attention problems with juvenile firesetting. Two datasets from the Achenbach System for Empirically Based Assessment (ASEBA) were used. The Factor Analysis Dataset (N = 975) was utilized and results indicated that adolescents who report lower academic performance are more likely to set fires. Additionally, adolescents who report a poor attitude toward school ...
Self-reported juvenile firesetting: Results from two national survey datasets

OpenAIRE

Carrie Howell Bowling; Joav eMerrick; Joav eMerrick; Joav eMerrick; Joav eMerrick; Hatim A Omar

2013-01-01

The main purpose of this study was to address gaps in existing research by examining the relationship between academic performance and attention problems with juvenile firesetting. Two datasets from the Achenbach System for Empirically Based Assessment (ASEBA) were used. The Factor Analysis Dataset (N = 975) was utilized and results indicated that adolescents who report lower academic performance are more likely to set fires. Additionally, adolescents who report a poor attitude toward school...
A high quality finger vascular pattern dataset collected using a custom designed capturing device

NARCIS (Netherlands)

Ton, B.T.; Veldhuis, Raymond N.J.

2013-01-01

The number of finger vascular pattern datasets available for the research community is scarce, therefore a new finger vascular pattern dataset containing 1440 images is prsented. This dataset is unique in its kind as the images are of high resolution and have a known pixel density. Furthermore this
RetroTransformDB: A Dataset of Generic Transforms for Retrosynthetic Analysis

Directory of Open Access Journals (Sweden)

Svetlana Avramova

2018-04-01

Full Text Available Presently, software tools for retrosynthetic analysis are widely used by organic, medicinal, and computational chemists. Rule-based systems extensively use collections of retro-reactions (transforms. While there are many public datasets with reactions in synthetic direction (usually non-generic reactions, there are no publicly-available databases with generic reactions in computer-readable format which can be used for the purposes of retrosynthetic analysis. Here we present RetroTransformDB—a dataset of transforms, compiled and coded in SMIRKS line notation by us. The collection is comprised of more than 100 records, with each one including the reaction name, SMIRKS linear notation, the functional group to be obtained, and the transform type classification. All SMIRKS transforms were tested syntactically, semantically, and from a chemical point of view in different software platforms. The overall dataset design and the retrosynthetic fitness were analyzed and curated by organic chemistry experts. The RetroTransformDB dataset may be used by open-source and commercial software packages, as well as chemoinformatics tools.
A multi-environment dataset for activity of daily living recognition in video streams.

Science.gov (United States)

Borreo, Alessandro; Onofri, Leonardo; Soda, Paolo

2015-08-01

Public datasets played a key role in the increasing level of interest that vision-based human action recognition has attracted in last years. While the production of such datasets has been influenced by the variability introduced by various actors performing the actions, the different modalities of interactions with the environment introduced by the variation of the scenes around the actors has been scarcely took into account. As a consequence, public datasets do not provide a proper test-bed for recognition algorithms that aim at achieving high accuracy, irrespective of the environment where actions are performed. This is all the more so, when systems are designed to recognize activities of daily living (ADL), which are characterized by a high level of human-environment interaction. For that reason, we present in this manuscript the MEA dataset, a new multi-environment ADL dataset, which permitted us to show how the change of scenario can affect the performances of state-of-the-art approaches for action recognition.
Visual Comparison of Multiple Gene Expression Datasets in a Genomic Context

Directory of Open Access Journals (Sweden)

Borowski Krzysztof

2008-06-01

Full Text Available The need for novel methods of visualizing microarray data is growing. New perspectives are beneficial to finding patterns in expression data. The Bluejay genome browser provides an integrative way of visualizing gene expression datasets in a genomic context. We have now developed the functionality to display multiple microarray datasets simultaneously in Bluejay, in order to provide researchers with a comprehensive view of their datasets linked to a graphical representation of gene function. This will enable biologists to obtain valuable insights on expression patterns, by allowing them to analyze the expression values in relation to the gene locations as well as to compare expression profiles of related genomes or of di erent experiments for the same genome.
AFSC/REFM: Seabird Necropsy dataset of North Pacific

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — The seabird necropsy dataset contains information on seabird specimens that were collected under salvage and scientific collection permits primarily by...
Social Water Science Data: Dimensions, Data Management, and Visualization

Science.gov (United States)

Jones, A. S.; Horsburgh, J. S.; Flint, C.; Jackson-Smith, D.

2016-12-01

Water systems are increasingly conceptualized as coupled human-natural systems, with growing emphasis on representing the human element in hydrology. However, social science data and associated considerations may be unfamiliar and intimidating to many hydrologic researchers. Monitoring social aspects of water systems involves expanding the range of data types typically used in hydrology and appreciating nuances in datasets that are well known to social scientists, but less understood by hydrologists. We define social water science data as any information representing the human aspects of a water system. We present a scheme for classifying these data, highlight an array of data types, and illustrate data management considerations and challenges unique to social science data. This classification scheme was applied to datasets generated as part of iUTAH (innovative Urban Transitions and Arid region Hydro-sustainability), an interdisciplinary water research project based in Utah, USA that seeks to integrate and share social and biophysical water science data. As the project deployed cyberinfrastructure for baseline biophysical data, cyberinfrastructure for analogous social science data was necessary. As a particular case of social water science data, we focus in this presentation on social science survey data. These data are often interpreted through the lens of the original researcher and are typically presented to interested parties in static figures or reports. To provide more exploratory and dynamic communication of these data beyond the individual or team who collected the data, we developed a web-based, interactive viewer to visualize social science survey responses. This interface is applicable for examining survey results that show human motivations and actions related to environmental systems and as a useful tool for participatory decision-making. It also serves as an example of how new data sharing and visualization tools can be developed once the
Data Prospecting Framework - a new approach to explore "big data" in Earth Science

Science.gov (United States)

Ramachandran, R.; Rushing, J.; Lin, A.; Kuo, K.

2012-12-01

Due to advances in sensors, computation and storage, cost and effort required to produce large datasets have been significantly reduced. As a result, we are seeing a proliferation of large-scale data sets being assembled in almost every science field, especially in geosciences. Opportunities to exploit the "big data" are enormous as new hypotheses can be generated by combining and analyzing large amounts of data. However, such a data-driven approach to science discovery assumes that scientists can find and isolate relevant subsets from vast amounts of available data. Current Earth Science data systems only provide data discovery through simple metadata and keyword-based searches and are not designed to support data exploration capabilities based on the actual content. Consequently, scientists often find themselves downloading large volumes of data, struggling with large amounts of storage and learning new analysis technologies that will help them separate the wheat from the chaff. New mechanisms of data exploration are needed to help scientists discover the relevant subsets We present data prospecting, a new content-based data analysis paradigm to support data-intensive science. Data prospecting allows the researchers to explore big data in determining and isolating data subsets for further analysis. This is akin to geo-prospecting in which mineral sites of interest are determined over the landscape through screening methods. The resulting "data prospects" only provide an interaction with and feel for the data through first-look analytics; the researchers would still have to download the relevant datasets and analyze them deeply using their favorite analytical tools to determine if the datasets will yield new hypotheses. Data prospecting combines two traditional categories of data analysis, data exploration and data mining within the discovery step. Data exploration utilizes manual/interactive methods for data analysis such as standard statistical analysis and

Random Coefficient Logit Model for Large Datasets

NARCIS (Netherlands)

C. Hernández-Mireles (Carlos); D. Fok (Dennis)

2010-01-01

textabstractWe present an approach for analyzing market shares and products price elasticities based on large datasets containing aggregate sales data for many products, several markets and for relatively long time periods. We consider the recently proposed Bayesian approach of Jiang et al [Jiang,
NOAA Global Surface Temperature Dataset, Version 4.0

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — The NOAA Global Surface Temperature Dataset (NOAAGlobalTemp) is derived from two independent analyses: the Extended Reconstructed Sea Surface Temperature (ERSST)...
A multimodal MRI dataset of professional chess players.

Science.gov (United States)

Li, Kaiming; Jiang, Jing; Qiu, Lihua; Yang, Xun; Huang, Xiaoqi; Lui, Su; Gong, Qiyong

2015-01-01

Chess is a good model to study high-level human brain functions such as spatial cognition, memory, planning, learning and problem solving. Recent studies have demonstrated that non-invasive MRI techniques are valuable for researchers to investigate the underlying neural mechanism of playing chess. For professional chess players (e.g., chess grand masters and masters or GM/Ms), what are the structural and functional alterations due to long-term professional practice, and how these alterations relate to behavior, are largely veiled. Here, we report a multimodal MRI dataset from 29 professional Chinese chess players (most of whom are GM/Ms), and 29 age matched novices. We hope that this dataset will provide researchers with new materials to further explore high-level human brain functions.
REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

Science.gov (United States)

Moulik, P.; Lekic, V.; Romanowicz, B. A.

2017-12-01

A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history
USGS National Hydrography Dataset from The National Map

Data.gov (United States)

U.S. Geological Survey, Department of the Interior — USGS The National Map - National Hydrography Dataset (NHD) is a comprehensive set of digital spatial data that encodes information about naturally occurring and...
Newton SSANTA Dr Water using POU filters dataset

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset contains information about all the features extracted from the raw data files, the formulas that were assigned to some of these features, and the...
Full-Scale Approximations of Spatio-Temporal Covariance Models for Large Datasets

KAUST Repository

Zhang, Bohai

2014-01-01

Various continuously-indexed spatio-temporal process models have been constructed to characterize spatio-temporal dependence structures, but the computational complexity for model fitting and predictions grows in a cubic order with the size of dataset and application of such models is not feasible for large datasets. This article extends the full-scale approximation (FSA) approach by Sang and Huang (2012) to the spatio-temporal context to reduce computational complexity. A reversible jump Markov chain Monte Carlo (RJMCMC) algorithm is proposed to select knots automatically from a discrete set of spatio-temporal points. Our approach is applicable to nonseparable and nonstationary spatio-temporal covariance models. We illustrate the effectiveness of our method through simulation experiments and application to an ozone measurement dataset.
Using Online Citizen Science to Assess Giant Kelp Abundances Across the Globe with Satellite Imagery

Science.gov (United States)

Byrnes, J.; Cavanaugh, K. C.; Haupt, A. J.; Trouille, L.; Rosenthal, I.; Bell, T. W.; Rassweiler, A.; Pérez-Matus, A.; Assis, J.

2017-12-01

Global scale long-term data sets that document the patterns and variability of human impacts on marine ecosystems are rare. This lack is particularly glaring for underwater species - even moreso for ecologically important ones. Here we demonstrate how online Citizen Science combined with Landsat satellite imagery can help build a picture of change in the dynamics of giant kelp, an important coastal foundation species around the globe, from the 1984 to the present. Giant kelp canopy is visible from Landsat images, but these images defy easy machine classification. To get useful data, images must be processed by hand. While academic researchers have applied this method successfully at sub-regional scales, unlocking the value of the full global dataset has not been possible until given the massive effort required. Here we present Floating Forests (http://floatingforests.org), an international collaboration between kelp forest researchers and the citizen science organization Zooniverse. Floating Forests provides an interface that allows citizen scientists to identify canopy cover of giant kelp on Landsat images, enabling us to scale up the dataset to the globe. We discuss lessons learned from the initial version of the project launched in 2014, a prototype of an image processing pipeline to bring Landsat imagery to citizen science platforms, methods of assessing accuracy of citizen scientists, and preliminary data from our relaunch of the project. Through this project we have developed generalizable tools to facilitate citizen science-based analysis of Landsat and other satellite and aerial imagery. We hope that this create a powerful dataset to unlock our understanding of how global change has altered these critically important species in the sea.
Collaborative Science Using Web Services and the SciFlo Grid Dataflow Engine

Science.gov (United States)

Wilson, B. D.; Manipon, G.; Xing, Z.; Yunck, T.

2006-12-01

The General Earth Science Investigation Suite (GENESIS) project is a NASA-sponsored partnership between the Jet Propulsion Laboratory, academia, and NASA data centers to develop a new suite of Web Services tools to facilitate multi-sensor investigations in Earth System Science. The goal of GENESIS is to enable large-scale, multi-instrument atmospheric science using combined datasets from the AIRS, MODIS, MISR, and GPS sensors. Investigations include cross-comparison of spaceborne climate sensors, cloud spectral analysis, study of upper troposphere-stratosphere water transport, study of the aerosol indirect cloud effect, and global climate model validation. The challenges are to bring together very large datasets, reformat and understand the individual instrument retrievals, co-register or re-grid the retrieved physical parameters, perform computationally-intensive data fusion and data mining operations, and accumulate complex statistics over months to years of data. To meet these challenges, we have developed a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data access, subsetting, registration, mining, fusion, compression, and advanced statistical analysis. SciFlo leverages remote Web Services, called via Simple Object Access Protocol (SOAP) or REST (one-line) URLs, and the Grid Computing standards (WS-* &Globus Alliance toolkits), and enables scientists to do multi-instrument Earth Science by assembling reusable Web Services and native executables into a distributed computing flow (tree of operators). The SciFlo client &server engines optimize the execution of such distributed data flows and allow the user to transparently find and use datasets and operators without worrying about the actual location of the Grid resources. In particular, SciFlo exploits the wealth of datasets accessible by OpenGIS Consortium (OGC) Web Mapping Servers & Web Coverage Servers (WMS/WCS), and by Open Data
USGS National Boundary Dataset (NBD) Downloadable Data Collection

Data.gov (United States)

U.S. Geological Survey, Department of the Interior — The USGS Governmental Unit Boundaries dataset from The National Map (TNM) represents major civil areas for the Nation, including States or Territories, counties (or...
Thesaurus Dataset of Educational Technology in Chinese

Science.gov (United States)

Wu, Linjing; Liu, Qingtang; Zhao, Gang; Huang, Huan; Huang, Tao

2015-01-01

The thesaurus dataset of educational technology is a knowledge description of educational technology in Chinese. The aims of this thesaurus were to collect the subject terms in the domain of educational technology, facilitate the standardization of terminology and promote the communication between Chinese researchers and scholars from various…
BASE MAP DATASET, LE FLORE COUNTY, OKLAHOMA, USA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme, orthographic...
Exudate-based diabetic macular edema detection in fundus images using publicly available datasets

Energy Technology Data Exchange (ETDEWEB)

Giancardo, Luca [ORNL; Meriaudeau, Fabrice [ORNL; Karnowski, Thomas Paul [ORNL; Li, Yaquin [University of Tennessee, Knoxville (UTK); Garg, Seema [University of North Carolina; Tobin Jr, Kenneth William [ORNL; Chaum, Edward [University of Tennessee, Knoxville (UTK)

2011-01-01

Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME through the presence of exudation. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME. This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing (e.g., the classifier was trained on an independent dataset and tested on MESSIDOR). Our algorithm obtained an AUC between 0.88 and 0.94 depending on the dataset/features used. Additionally, it does not need ground truth at lesion level to reject false positives and is computationally efficient, as it generates a diagnosis on an average of 4.4 s (9.3 s, considering the optic nerve localization) per image on an 2.6 GHz platform with an unoptimized Matlab implementation.
A conceptual prototype for the next-generation national elevation dataset

Science.gov (United States)

Stoker, Jason M.; Heidemann, Hans Karl; Evans, Gayla A.; Greenlee, Susan K.

2013-01-01

In 2012 the U.S. Geological Survey's (USGS) National Geospatial Program (NGP) funded a study to develop a conceptual prototype for a new National Elevation Dataset (NED) design with expanded capabilities to generate and deliver a suite of bare earth and above ground feature information over the United States. This report details the research on identifying operational requirements based on prior research, evaluation of what is needed for the USGS to meet these requirements, and development of a possible conceptual framework that could potentially deliver the kinds of information that are needed to support NGP's partners and constituents. This report provides an initial proof-of-concept demonstration using an existing dataset, and recommendations for the future, to inform NGP's ongoing and future elevation program planning and management decisions. The demonstration shows that this type of functional process can robustly create derivatives from lidar point cloud data; however, more research needs to be done to see how well it extends to multiple datasets.
Exploring massive, genome scale datasets with the GenometriCorr package.

Directory of Open Access Journals (Sweden)

Alexander Favorov

2012-05-01

Full Text Available We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features, for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets.The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor.
Kernel-based discriminant feature extraction using a representative dataset

Science.gov (United States)

Li, Honglin; Sancho Gomez, Jose-Luis; Ahalt, Stanley C.

2002-07-01

Discriminant Feature Extraction (DFE) is widely recognized as an important pre-processing step in classification applications. Most DFE algorithms are linear and thus can only explore the linear discriminant information among the different classes. Recently, there has been several promising attempts to develop nonlinear DFE algorithms, among which is Kernel-based Feature Extraction (KFE). The efficacy of KFE has been experimentally verified by both synthetic data and real problems. However, KFE has some known limitations. First, KFE does not work well for strongly overlapped data. Second, KFE employs all of the training set samples during the feature extraction phase, which can result in significant computation when applied to very large datasets. Finally, KFE can result in overfitting. In this paper, we propose a substantial improvement to KFE that overcomes the above limitations by using a representative dataset, which consists of critical points that are generated from data-editing techniques and centroid points that are determined by using the Frequency Sensitive Competitive Learning (FSCL) algorithm. Experiments show that this new KFE algorithm performs well on significantly overlapped datasets, and it also reduces computational complexity. Further, by controlling the number of centroids, the overfitting problem can be effectively alleviated.
Topographical effects of climate dataset and their impacts on the estimation of regional net primary productivity

Science.gov (United States)

Sun, L. Qing; Feng, Feng X.

2014-11-01

In this study, we first built and compared two different climate datasets for Wuling mountainous area in 2010, one of which considered topographical effects during the ANUSPLIN interpolation was referred as terrain-based climate dataset, while the other one did not was called ordinary climate dataset. Then, we quantified the topographical effects of climatic inputs on NPP estimation by inputting two different climate datasets to the same ecosystem model, the Boreal Ecosystem Productivity Simulator (BEPS), to evaluate the importance of considering relief when estimating NPP. Finally, we found the primary contributing variables to the topographical effects through a series of experiments given an overall accuracy of the model output for NPP. The results showed that: (1) The terrain-based climate dataset presented more reliable topographic information and had closer agreements with the station dataset than the ordinary climate dataset at successive time series of 365 days in terms of the daily mean values. (2) On average, ordinary climate dataset underestimated NPP by 12.5% compared with terrain-based climate dataset over the whole study area. (3) The primary climate variables contributing to the topographical effects of climatic inputs for Wuling mountainous area were temperatures, which suggest that it is necessary to correct temperature differences for estimating NPP accurately in such a complex terrain.
USGS Watershed Boundary Dataset (WBD) Overlay Map Service from The National Map - National Geospatial Data Asset (NGDA) Watershed Boundary Dataset (WBD)

Data.gov (United States)

U.S. Geological Survey, Department of the Interior — The Watershed Boundary Dataset (WBD) from The National Map (TNM) defines the perimeter of drainage areas formed by the terrain and other landscape characteristics....
Assessment of NASA's Physiographic and Meteorological Datasets as Input to HSPF and SWAT Hydrological Models

Science.gov (United States)

Alacron, Vladimir J.; Nigro, Joseph D.; McAnally, William H.; OHara, Charles G.; Engman, Edwin Ted; Toll, David

2011-01-01

This paper documents the use of simulated Moderate Resolution Imaging Spectroradiometer land use/land cover (MODIS-LULC), NASA-LIS generated precipitation and evapo-transpiration (ET), and Shuttle Radar Topography Mission (SRTM) datasets (in conjunction with standard land use, topographical and meteorological datasets) as input to hydrological models routinely used by the watershed hydrology modeling community. The study is focused in coastal watersheds in the Mississippi Gulf Coast although one of the test cases focuses in an inland watershed located in northeastern State of Mississippi, USA. The decision support tools (DSTs) into which the NASA datasets were assimilated were the Soil Water & Assessment Tool (SWAT) and the Hydrological Simulation Program FORTRAN (HSPF). These DSTs are endorsed by several US government agencies (EPA, FEMA, USGS) for water resources management strategies. These models use physiographic and meteorological data extensively. Precipitation gages and USGS gage stations in the region were used to calibrate several HSPF and SWAT model applications. Land use and topographical datasets were swapped to assess model output sensitivities. NASA-LIS meteorological data were introduced in the calibrated model applications for simulation of watershed hydrology for a time period in which no weather data were available (1997-2006). The performance of the NASA datasets in the context of hydrological modeling was assessed through comparison of measured and model-simulated hydrographs. Overall, NASA datasets were as useful as standard land use, topographical , and meteorological datasets. Moreover, NASA datasets were used for performing analyses that the standard datasets could not made possible, e.g., introduction of land use dynamics into hydrological simulations
A novel dataset for real-life evaluation of facial expression recognition methodologies

NARCIS (Netherlands)

Siddiqi, Muhammad Hameed; Ali, Maqbool; Idris, Muhammad; Banos Legran, Oresti; Lee, Sungyoung; Choo, Hyunseung

2016-01-01

One limitation seen among most of the previous methods is that they were evaluated under settings that are far from real-life scenarios. The reason is that the existing facial expression recognition (FER) datasets are mostly pose-based and assume a predefined setup. The expressions in these datasets

Creating a Regional MODIS Satellite-Driven Net Primary Production Dataset for European Forests

Directory of Open Access Journals (Sweden)

Mathias Neumann

2016-06-01

Full Text Available Net primary production (NPP is an important ecological metric for studying forest ecosystems and their carbon sequestration, for assessing the potential supply of food or timber and quantifying the impacts of climate change on ecosystems. The global MODIS NPP dataset using the MOD17 algorithm provides valuable information for monitoring NPP at 1-km resolution. Since coarse-resolution global climate data are used, the global dataset may contain uncertainties for Europe. We used a 1-km daily gridded European climate data set with the MOD17 algorithm to create the regional NPP dataset MODIS EURO. For evaluation of this new dataset, we compare MODIS EURO with terrestrial driven NPP from analyzing and harmonizing forest inventory data (NFI from 196,434 plots in 12 European countries as well as the global MODIS NPP dataset for the years 2000 to 2012. Comparing these three NPP datasets, we found that the global MODIS NPP dataset differs from NFI NPP by 26%, while MODIS EURO only differs by 7%. MODIS EURO also agrees with NFI NPP across scales (from continental, regional to country and gradients (elevation, location, tree age, dominant species, etc.. The agreement is particularly good for elevation, dominant species or tree height. This suggests that using improved climate data allows the MOD17 algorithm to provide realistic NPP estimates for Europe. Local discrepancies between MODIS EURO and NFI NPP can be related to differences in stand density due to forest management and the national carbon estimation methods. With this study, we provide a consistent, temporally continuous and spatially explicit productivity dataset for the years 2000 to 2012 on a 1-km resolution, which can be used to assess climate change impacts on ecosystems or the potential biomass supply of the European forests for an increasing bio-based economy. MODIS EURO data are made freely available at ftp://palantir.boku.ac.at/Public/MODIS_EURO.
NASA/NOAA: Earth Science Electronic Theater 1999. Earth Science Observations, Analysis and Visualization: Roots in the 60s - Vision for the Next Millennium

Science.gov (United States)

Hasler, A. Fritz

1999-01-01

The Etheater presents visualizations which span the period from the original Suomi/Hasler animations of the first ATS-1 GEO weather satellite images in 1966, to the latest 1999 NASA Earth Science Vision for the next 25 years. Hot off the SGI-Onyx Graphics-Supercomputer are NASA''s visualizations of Hurricanes Mitch, Georges, Fran and Linda. These storms have been recently featured on the covers of National Geographic, Time, Newsweek and Popular Science. Highlights will be shown from the NASA hurricane visualization resource video tape that has been used repeatedly this season on National and International network TV. Results will be presented from a new paper on automatic wind measurements in Hurricane Luis from 1-min GOES images that appeared in the November BAMS. The visualizations are produced by the NASA Goddard Visualization & Analysis Laboratory, and Scientific Visualization Studio, as well as other Goddard and NASA groups using NASA, NOAA, ESA, and NASDA Earth science datasets. Visualizations will be shown from the Earth Science ETheater 1999 recently presented in Tokyo, Paris, Munich, Sydney, Melbourne, Honolulu, Washington, New York, and Dallas. The presentation Jan 11-14 at the AMS meeting in Dallas used a 4-CPU SGI/CRAY Onyx Infinite Reality Super Graphics Workstation with 8 GB RAM and a Terabyte Disk at 3840 X 1024 resolution with triple synchronized BarcoReality 9200 projectors on a 60ft wide screen. Visualizations will also be featured from the new Earth Today Exhibit which was opened by Vice President Gore on July 2, 1998 at the Smithsonian Air & Space Museum in Washington, as well as those presented for possible use at the American Museum of Natural History (NYC), Disney EPCOT, and other venues. New methods are demonstrated for visualizing, interpreting, comparing, organizing and analyzing immense HyperImage remote sensing datasets and three dimensional numerical model results. We call the data from many new Earth sensing satellites, Hyper
Differential Gene Expression by Lactobacillus plantarum WCFS1 in Response to Phenolic Compounds Reveals New Genes Involved in Tannin Degradation.

Science.gov (United States)

Reverón, Inés; Jiménez, Natalia; Curiel, José Antonio; Peñas, Elena; López de Felipe, Félix; de Las Rivas, Blanca; Muñoz, Rosario

2017-04-01

Lactobacillus plantarum is a lactic acid bacterium that can degrade food tannins by the successive action of tannase and gallate decarboxylase enzymes. In the L. plantarum genome, the gene encoding the catalytic subunit of gallate decarboxylase ( lpdC , or lp_2945 ) is only 6.5 kb distant from the gene encoding inducible tannase ( L. plantarum tanB [ tanB Lp ], or lp_2956 ). This genomic context suggests concomitant activity and regulation of both enzymatic activities. Reverse transcription analysis revealed that subunits B ( lpdB , or lp_0271 ) and D ( lpdD , or lp_0272 ) of the gallate decarboxylase are cotranscribed, whereas subunit C ( lpdC , or lp_2945 ) is cotranscribed with a gene encoding a transport protein ( gacP , or lp_2943 ). In contrast, the tannase gene is transcribed as a monocistronic mRNA. Investigation of knockout mutations of genes located in this chromosomal region indicated that only mutants of the gallate decarboxylase (subunits B and C), tannase, GacP transport protein, and TanR transcriptional regulator ( lp_2942 ) genes exhibited altered tannin metabolism. The expression profile of genes involved in tannin metabolism was also analyzed in these mutants in the presence of methyl gallate and gallic acid. It is noteworthy that inactivation of tanR suppresses the induction of all genes overexpressed in the presence of methyl gallate and gallic acid. This transcriptional regulator was also induced in the presence of other phenolic compounds, such as kaempferol and myricetin. This study complements the catalog of L. plantarum expression profiles responsive to phenolic compounds, which enable this bacterium to adapt to a plant food environment. IMPORTANCE Lactobacillus plantarum is a bacterial species frequently found in the fermentation of vegetables when tannins are present. L. plantarum strains degrade tannins to the less-toxic pyrogallol by the successive action of tannase and gallate decarboxylase enzymes. The genes encoding these enzymes are
Citizen science on a smartphone: Participants' motivations and learning.

Science.gov (United States)

Land-Zandstra, Anne M; Devilee, Jeroen L A; Snik, Frans; Buurmeijer, Franka; van den Broek, Jos M

2016-01-01

Citizen science provides researchers means to gather or analyse large datasets. At the same time, citizen science projects offer an opportunity for non-scientists to be part of and learn from the scientific process. In the Dutch iSPEX project, a large number of citizens turned their smartphones into actual measurement devices to measure aerosols. This study examined participants' motivation and perceived learning impacts of this unique project. Most respondents joined iSPEX because they wanted to contribute to the scientific goals of the project or because they were interested in the project topics (health and environmental impact of aerosols). In terms of learning impact, respondents reported a gain in knowledge about citizen science and the topics of the project. However, many respondents had an incomplete understanding of the science behind the project, possibly caused by the complexity of the measurements. © The Author(s) 2015.
Boundary expansion algorithm of a decision tree induction for an imbalanced dataset

Directory of Open Access Journals (Sweden)

Kesinee Boonchuay

2017-10-01

Full Text Available A decision tree is one of the famous classifiers based on a recursive partitioning algorithm. This paper introduces the Boundary Expansion Algorithm (BEA to improve a decision tree induction that deals with an imbalanced dataset. BEA utilizes all attributes to define non-splittable ranges. The computed means of all attributes for minority instances are used to find the nearest minority instance, which will be expanded along all attributes to cover a minority region. As a result, BEA can successfully cope with an imbalanced dataset comparing with C4.5, Gini, asymmetric entropy, top-down tree, and Hellinger distance decision tree on 25 imbalanced datasets from the UCI Repository.
The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: National Elevation Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset represents the elevation values within individual local NHDPlusV2 catchments and upstream, contributing watersheds based on the National Elevation...
Socioeconomic Data and Applications Center (SEDAC) Treaty Status Dataset

Data.gov (United States)

National Aeronautics and Space Administration — The Socioeconomic Data and Application Center (SEDAC) Treaty Status Dataset contains comprehensive treaty information for multilateral environmental agreements,...
Gait biomechanics in the era of data science.

Science.gov (United States)

Ferber, Reed; Osis, Sean T; Hicks, Jennifer L; Delp, Scott L

2016-12-08

Data science has transformed fields such as computer vision and economics. The ability of modern data science methods to extract insights from large, complex, heterogeneous, and noisy datasets is beginning to provide a powerful complement to the traditional approaches of experimental motion capture and biomechanical modeling. The purpose of this article is to provide a perspective on how data science methods can be incorporated into our field to advance our understanding of gait biomechanics and improve treatment planning procedures. We provide examples of how data science approaches have been applied to biomechanical data. We then discuss the challenges that remain for effectively using data science approaches in clinical gait analysis and gait biomechanics research, including the need for new tools, better infrastructure and incentives for sharing data, and education across the disciplines of biomechanics and data science. By addressing these challenges, we can revolutionize treatment planning and biomechanics research by capitalizing on the wealth of knowledge gained by gait researchers over the past decades and the vast, but often siloed, data that are collected in clinical and research laboratories around the world. Copyright © 2016 Elsevier Ltd. All rights reserved.
Karna Particle Size Dataset for Tables and Figures

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset contains 1) table of bulk Pb-XAS LCF results, 2) table of bulk As-XAS LCF results, 3) figure data of particle size distribution, and 4) figure data for...
Quality Controlling CMIP datasets at GFDL

Science.gov (United States)

Horowitz, L. W.; Radhakrishnan, A.; Balaji, V.; Adcroft, A.; Krasting, J. P.; Nikonov, S.; Mason, E. E.; Schweitzer, R.; Nadeau, D.

2017-12-01

As GFDL makes the switch from model development to production in light of the Climate Model Intercomparison Project (CMIP), GFDL's efforts are shifted to testing and more importantly establishing guidelines and protocols for Quality Controlling and semi-automated data publishing. Every CMIP cycle introduces key challenges and the upcoming CMIP6 is no exception. The new CMIP experimental design comprises of multiple MIPs facilitating research in different focus areas. This paradigm has implications not only for the groups that develop the models and conduct the runs, but also for the groups that monitor, analyze and quality control the datasets before data publishing, before their knowledge makes its way into reports like the IPCC (Intergovernmental Panel on Climate Change) Assessment Reports. In this talk, we discuss some of the paths taken at GFDL to quality control the CMIP-ready datasets including: Jupyter notebooks, PrePARE, LAMP (Linux, Apache, MySQL, PHP/Python/Perl): technology-driven tracker system to monitor the status of experiments qualitatively and quantitatively, provide additional metadata and analysis services along with some in-built controlled-vocabulary validations in the workflow. In addition to this, we also discuss the integration of community-based model evaluation software (ESMValTool, PCMDI Metrics Package, and ILAMB) as part of our CMIP6 workflow.
Earth Science Data for a Mobile Age

Science.gov (United States)

Oostra, D.; Chambers, L. H.; Lewis, P. M.; Baize, R.; Oots, P.; Rogerson, T.; Crecelius, S.; Coleman, T.

2012-12-01

Earth science data access needs to be interoperable and automatic. Recently, increasingly savvy data users combined with more complex web and mobile applications have placed increasing demands on how this Earth science data is being delivered to educators and students. The MY NASA DATA (MND) and S'COOL projects are developing a strategy to interact with the education community in the age of mobile devices and platforms. How can we provide data and meaningful scientific experiences to educational users through mobile technologies? This initiative will seek out existing technologies and stakeholders within the Earth Science community to identify datasets that are relevant and appropriate for mobile application development and use by the educational community. Targeting efforts within the educational community will give the project a better understanding of the previous attempts at data/mobile application use in the classroom and its problems. In addition, we will query developers and data providers on what successes and failures they've experienced in trying to provide data for applications designed on mobile platforms. This feedback will be implemented in new websites, applications and lessons that will provide authentic scientific experiences for students and end users. We want to create tools that help sort through the vast amounts of NASA data, and deliver it to users automatically. NASA provides millions of gigabytes of data that is publicly available through a large number of services spread across the World Wide Web. Accessing and navigating this data can be time consuming and problematic with variety of file types and methods for accessing this data. The MND project, through its' Live Access Server system, provides selected datasets that are relevant and targets National Standards of Learning for educators to easily integrate into existing curricula. In the future, we want to provide desired data to users with automatic updates, anticipate future data queries
Safari Science: Assessing the reliability of citizen science data for wildlife surveys

Science.gov (United States)

Steger, Cara; Butt, Bilal; Hooten, Mevin B.

2017-01-01

Protected areas are the cornerstone of global conservation, yet financial support for basic monitoring infrastructure is lacking in 60% of them. Citizen science holds potential to address these shortcomings in wildlife monitoring, particularly for resource-limited conservation initiatives in developing countries – if we can account for the reliability of data produced by volunteer citizen scientists (VCS).This study tests the reliability of VCS data vs. data produced by trained ecologists, presenting a hierarchical framework for integrating diverse datasets to assess extra variability from VCS data.Our results show that while VCS data are likely to be overdispersed for our system, the overdispersion varies widely by species. We contend that citizen science methods, within the context of East African drylands, may be more appropriate for species with large body sizes, which are relatively rare, or those that form small herds. VCS perceptions of the charisma of a species may also influence their enthusiasm for recording it.Tailored programme design (such as incentives for VCS) may mitigate the biases in citizen science data and improve overall participation. However, the cost of designing and implementing high-quality citizen science programmes may be prohibitive for the small protected areas that would most benefit from these approaches.Synthesis and applications. As citizen science methods continue to gain momentum, it is critical that managers remain cautious in their implementation of these programmes while working to ensure methods match data purpose. Context-specific tests of citizen science data quality can improve programme implementation, and separate data models should be used when volunteer citizen scientists' variability differs from trained ecologists' data. Partnerships across protected areas and between protected areas and other conservation institutions could help to cover the costs of citizen science programme design and implementation.
Education for eScience Professionals: Integrating Data Curation and Cyberinfrastructure

Directory of Open Access Journals (Sweden)

Youngseek Kim

2011-03-01

Full Text Available Large, collaboratively managed datasets have become essential to many scientific and engineering endeavors, and their management has increased the need for "eScience professionals" who solve large scale information management problems for researchers and engineers. This paper considers the dimensions of work, worker, and workplace, including the knowledge, skills, and abilities needed for eScience professionals. We used focus groups and interviews to explore the needs of scientific researchers and how these needs may translate into curricular and program development choices. A cohort of five masters students also worked in targeted internship settings and completed internship logs. We organized this evidence into a job analysis that can be used for curriculum and program development at schools of library and information science.
Automatic registration method for multisensor datasets adopted for dimensional measurements on cutting tools

International Nuclear Information System (INIS)

Shaw, L; Mehari, F; Weckenmann, A; Ettl, S; Häusler, G

2013-01-01

Multisensor systems with optical 3D sensors are frequently employed to capture complete surface information by measuring workpieces from different views. During coarse and fine registration the resulting datasets are afterward transformed into one common coordinate system. Automatic fine registration methods are well established in dimensional metrology, whereas there is a deficit in automatic coarse registration methods. The advantage of a fully automatic registration procedure is twofold: it enables a fast and contact-free alignment and further a flexible application to datasets of any kind of optical 3D sensor. In this paper, an algorithm adapted for a robust automatic coarse registration is presented. The method was originally developed for the field of object reconstruction or localization. It is based on a segmentation of planes in the datasets to calculate the transformation parameters. The rotation is defined by the normals of three corresponding segmented planes of two overlapping datasets, while the translation is calculated via the intersection point of the segmented planes. First results have shown that the translation is strongly shape dependent: 3D data of objects with non-orthogonal planar flanks cannot be registered with the current method. In the novel supplement for the algorithm, the translation is additionally calculated via the distance between centroids of corresponding segmented planes, which results in more than one option for the transformation. A newly introduced measure considering the distance between the datasets after coarse registration evaluates the best possible transformation. Results of the robust automatic registration method are presented on the example of datasets taken from a cutting tool with a fringe-projection system and a focus-variation system. The successful application in dimensional metrology is proven with evaluations of shape parameters based on the registered datasets of a calibrated workpiece. (paper)
Aspects of science engagement, student background, and school characteristics: Impacts on science achievement of U.S. students

Science.gov (United States)

Grabau, Larry J.

and hands-on activities). Focused implementation of these research findings could enhance both science engagement and science achievement of U.S. students. I identified five keylimitations of my research project: the age of the dataset, the lack of racial/ethnic identifiers, the low proportion of student-level variance accounted for by multilevel models with aspects of science engagement as outcome variables, the lack of class-level measures, and the lack of inclusion of students' epistemological and fixed/flexible beliefs. These limitations provide opportunities for further investigations into these critical issues in science education.
Patterns of contribution to citizen science biodiversity projects increase understanding of volunteers' recording behaviour.

Science.gov (United States)

Boakes, Elizabeth H; Gliozzo, Gianfranco; Seymour, Valentine; Harvey, Martin; Smith, Chloë; Roy, David B; Haklay, Muki

2016-09-13

The often opportunistic nature of biological recording via citizen science leads to taxonomic, spatial and temporal biases which add uncertainty to biodiversity estimates. However, such biases may also give valuable insight into volunteers' recording behaviour. Using Greater London as a case-study we examined the composition of three citizen science datasets - from Greenspace Information for Greater London CIC, iSpot and iRecord - with respect to recorder contribution and spatial and taxonomic biases, i.e. when, where and what volunteers record. We found most volunteers contributed few records and were active for just one day. Each dataset had its own taxonomic and spatial signature suggesting that volunteers' personal recording preferences may attract them towards particular schemes. There were also patterns across datasets: species' abundance and ease of identification were positively associated with number of records, as was plant height. We found clear hotspots of recording activity, the 10 most popular sites containing open water. We note that biases are accrued as part of the recording process (e.g. species' detectability) as well as from volunteer preferences. An increased understanding of volunteer behaviour gained from analysing the composition of records could thus enhance the fit between volunteers' interests and the needs of scientific projects.
Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets

Science.gov (United States)

Fitriah Rusland, Nurul; Wahid, Norfaradilla; Kasim, Shahreen; Hafit, Hanayanti

2017-08-01

E-mail spam continues to become a problem on the Internet. Spammed e-mail may contain many copies of the same message, commercial advertisement or other irrelevant posts like pornographic content. In previous research, different filtering techniques are used to detect these e-mails such as using Random Forest, Naïve Bayesian, Support Vector Machine (SVM) and Neutral Network. In this research, we test Naïve Bayes algorithm for e-mail spam filtering on two datasets and test its performance, i.e., Spam Data and SPAMBASE datasets [8]. The performance of the datasets is evaluated based on their accuracy, recall, precision and F-measure. Our research use WEKA tool for the evaluation of Naïve Bayes algorithm for e-mail spam filtering on both datasets. The result shows that the type of email and the number of instances of the dataset has an influence towards the performance of Naïve Bayes.
Robot Science Autonomy in the Atacama Desert and Beyond

Science.gov (United States)

Thompson, David R.; Wettergreen, David S.

2013-01-01

Science-guided autonomy augments rovers with reasoning to make observations and take actions related to the objectives of scientific exploration. When rovers can directly interpret instrument measurements then scientific goals can inform and adapt ongoing navigation decisions. These autonomous explorers will make better scientific observations and collect massive, accurate datasets. In current astrobiology studies in the Atacama Desert we are applying algorithms for science autonomy to choose effective observations and measurements. Rovers are able to decide when and where to take follow-up actions that deepen scientific understanding. These techniques apply to planetary rovers, which we can illustrate with algorithms now used by Mars rovers and by discussing future missions.
Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets

Directory of Open Access Journals (Sweden)

Min-Wei Huang

2018-01-01

Full Text Available Many real-world medical datasets contain some proportion of missing (attribute values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.
Discovery of Teleconnections Using Data Mining Technologies in Global Climate Datasets

Directory of Open Access Journals (Sweden)

Fan Lin

2007-10-01

Full Text Available In this paper, we apply data mining technologies to a 100-year global land precipitation dataset and a 100-year Sea Surface Temperature (SST dataset. Some interesting teleconnections are discovered, including well-known patterns and unknown patterns (to the best of our knowledge, such as teleconnections between the abnormally low temperature events of the North Atlantic and floods in Northern Bolivia, abnormally low temperatures of the Venezuelan Coast and floods in Northern Algeria and Tunisia, etc. In particular, we use a high dimensional clustering method and a method that mines episode association rules in event sequences. The former is used to cluster the original time series datasets into higher spatial granularity, and the later is used to discover teleconnection patterns among events sequences that are generated by the clustering method. In order to verify our method, we also do experiments on the SOI index and a 100-year global land precipitation dataset and find many well-known teleconnections, such as teleconnections between SOI lower events and drought events of Eastern Australia, South Africa, and North Brazil; SOI lower events and flood events of the middle-lower reaches of Yangtze River; etc. We also do explorative experiments to help domain scientists discover new knowledge.

The Neutron Science TeraGrid Gateway, a TeraGrid Science Gateway to Support the Spallation Neutron Source

International Nuclear Information System (INIS)

Cobb, John W.; Geist, Al; Kohl, James Arthur; Miller, Stephen D; Peterson, Peter F.; Pike, Gregory; Reuter, Michael A; Swain, William; Vazhkudai, Sudharshan S.; Vijayakumar, Nithya N.

2006-01-01

The National Science Foundation's (NSF's) Extensible Terascale Facility (ETF), or TeraGrid (1) is entering its operational phase. An ETF science gateway effort is the Neutron Science TeraGrid Gateway (NSTG.) The Oak Ridge National Laboratory (ORNL) resource provider effort (ORNL-RP) during construction and now in operations is bridging a large scale experimental community and the TeraGrid as a large-scale national cyberinfrastructure. Of particular emphasis is collaboration with the Spallation Neutron Source (SNS) at ORNL. The U.S. Department of Energy's (DOE's) SNS (2) at ORNL will be commissioned in spring of 2006 as the world's brightest source of neutrons. Neutron science users can run experiments, generate datasets, perform data reduction, analysis, visualize results; collaborate with remotes users; and archive long term data in repositories with curation services. The ORNL-RP and the SNS data analysis group have spent 18 months developing and exploring user requirements, including the creation of prototypical services such as facility portal, data, and application execution services. We describe results from these efforts and discuss implications for science gateway creation. Finally, we show incorporation into implementation planning for the NSTG and SNS architectures. The plan is for a primarily portal-based user interaction supported by a service oriented architecture for functional implementation
Document Questionnaires and Datasets with DDI: A Hands-On Introduction with Colectica

OpenAIRE

Iverson, Jeremy; Smith, Dan

2018-01-01

This workshop offers a hands-on, practical approach to creating and documenting both surveys and datasets with DDI and Colectica. Participants will build and field a DDI-driven survey using their own questions or samples provided in the workshop. They will then ingest, annotate, and publish DDI dataset descriptions using the collected survey data.
Benefits and challenges of incorporating citizen science into university education.

Science.gov (United States)

Mitchell, Nicola; Triska, Maggie; Liberatore, Andrea; Ashcroft, Linden; Weatherill, Richard; Longnecker, Nancy

2017-01-01

A common feature of many citizen science projects is the collection of data by unpaid contributors with the expectation that the data will be used in research. Here we report a teaching strategy that combined citizen science with inquiry-based learning to offer first year university students an authentic research experience. A six-year partnership with the Australian phenology citizen science program ClimateWatch has enabled biology students from the University of Western Australia to contribute phenological data on plants and animals, and to conduct the first research on unvalidated species datasets contributed by public and university participants. Students wrote scientific articles on their findings, peer-reviewed each other's work and the best articles were published online in a student journal. Surveys of more than 1500 students showed that their environmental engagement increased significantly after participating in data collection and data analysis. However, only 31% of students agreed with the statement that "data collected by citizen scientists are reliable" at the end of the project, whereas the rate of agreement was initially 79%. This change in perception was likely due to students discovering erroneous records when they mapped data points and analysed submitted photographs. A positive consequence was that students subsequently reported being more careful to avoid errors in their own data collection, and making greater efforts to contribute records that were useful for future scientific research. Evaluation of our project has shown that by embedding a research process within citizen science participation, university students are given cause to improve their contributions to environmental datasets. If true for citizen scientists in general, enabling participants as well as scientists to analyse data could enhance data quality, and so address a key constraint of broad-scale citizen science programs.
Automatic Diabetic Macular Edema Detection in Fundus Images Using Publicly Available Datasets

Energy Technology Data Exchange (ETDEWEB)

Giancardo, Luca [ORNL; Meriaudeau, Fabrice [ORNL; Karnowski, Thomas Paul [ORNL; Li, Yaquin [University of Tennessee, Knoxville (UTK); Garg, Seema [University of North Carolina; Tobin Jr, Kenneth William [ORNL; Chaum, Edward [University of Tennessee, Knoxville (UTK)

2011-01-01

Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME. This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing. Our algorithm is robust to segmentation uncertainties, does not need ground truth at lesion level, and is very fast, generating a diagnosis on an average of 4.4 seconds per image on an 2.6 GHz platform with an unoptimised Matlab implementation.
A Unified Framework for Measuring Stewardship Practices Applied to Digital Environmental Datasets

Directory of Open Access Journals (Sweden)

Ge Peng

2015-01-01

Full Text Available This paper presents a stewardship maturity assessment model in the form of a matrix for digital environmental datasets. Nine key components are identified based on requirements imposed on digital environmental data and information that are cared for and disseminated by U.S. Federal agencies by U.S. law, i.e., Information Quality Act of 2001, agencies’ guidance, expert bodies’ recommendations, and users. These components include: preservability, accessibility, usability, production sustainability, data quality assurance, data quality control/monitoring, data quality assessment, transparency/traceability, and data integrity. A five-level progressive maturity scale is then defined for each component associated with measurable practices applied to individual datasets, representing Ad Hoc, Minimal, Intermediate, Advanced, and Optimal stages. The rationale for each key component and its maturity levels is described. This maturity model, leveraging community best practices and standards, provides a unified framework for assessing scientific data stewardship. It can be used to create a stewardship maturity scoreboard of dataset(s and a roadmap for scientific data stewardship improvement or to provide data quality and usability information to users, stakeholders, and decision makers.
Dataset Preservation for the Long Term: Results of the DareLux Project

Directory of Open Access Journals (Sweden)

Eugène Dürr

2008-08-01

Full Text Available The purpose of the DareLux (Data Archiving River Environment Luxembourg Project was the preservation of unique and irreplaceable datasets, for which we chose hydrology data that will be required to be used in future climatic models. The results are: an operational archive built with XML containers, the OAI-PMH protocol and an architecture based upon web services. Major conclusions are: quality control on ingest is important; digital rights management demands attention; and cost aspects of ingest and retrieval cannot be underestimated. We propose a new paradigm for information retrieval of this type of dataset. We recommend research into visualisation tools for the search and retrieval of this type of dataset.
Enhancing Conservation with High Resolution Productivity Datasets for the Conterminous United States

Science.gov (United States)

Robinson, Nathaniel Paul

Human driven alteration of the earth's terrestrial surface is accelerating through land use changes, intensification of human activity, climate change, and other anthropogenic pressures. These changes occur at broad spatio-temporal scales, challenging our ability to effectively monitor and assess the impacts and subsequent conservation strategies. While satellite remote sensing (SRS) products enable monitoring of the earth's terrestrial surface continuously across space and time, the practical applications for conservation and management of these products are limited. Often the processes driving ecological change occur at fine spatial resolutions and are undetectable given the resolution of available datasets. Additionally, the links between SRS data and ecologically meaningful metrics are weak. Recent advances in cloud computing technology along with the growing record of high resolution SRS data enable the development of SRS products that quantify ecologically meaningful variables at relevant scales applicable for conservation and management. The focus of my dissertation is to improve the applicability of terrestrial gross and net primary productivity (GPP/NPP) datasets for the conterminous United States (CONUS). In chapter one, I develop a framework for creating high resolution datasets of vegetation dynamics. I use the entire archive of Landsat 5, 7, and 8 surface reflectance data and a novel gap filling approach to create spatially continuous 30 m, 16-day composites of the normalized difference vegetation index (NDVI) from 1986 to 2016. In chapter two, I integrate this with other high resolution datasets and the MOD17 algorithm to create the first high resolution GPP and NPP datasets for CONUS. I demonstrate the applicability of these products for conservation and management, showing the improvements beyond currently available products. In chapter three, I utilize this dataset to evaluate the relationships between land ownership and terrestrial production
CLARA-A1: a cloud, albedo, and radiation dataset from 28 yr of global AVHRR data

Directory of Open Access Journals (Sweden)

K.-G. Karlsson

2013-05-01

Full Text Available A new satellite-derived climate dataset – denoted CLARA-A1 ("The CM SAF cLoud, Albedo and RAdiation dataset from AVHRR data" – is described. The dataset covers the 28 yr period from 1982 until 2009 and consists of cloud, surface albedo, and radiation budget products derived from the AVHRR (Advanced Very High Resolution Radiometer sensor carried by polar-orbiting operational meteorological satellites. Its content, anticipated accuracies, limitations, and potential applications are described. The dataset is produced by the EUMETSAT Climate Monitoring Satellite Application Facility (CM SAF project. The dataset has its strengths in the long duration, its foundation upon a homogenized AVHRR radiance data record, and in some unique features, e.g. the availability of 28 yr of summer surface albedo and cloudiness parameters over the polar regions. Quality characteristics are also well investigated and particularly useful results can be found over the tropics, mid to high latitudes and over nearly all oceanic areas. Being the first CM SAF dataset of its kind, an intensive evaluation of the quality of the datasets was performed and major findings with regard to merits and shortcomings of the datasets are reported. However, the CM SAF's long-term commitment to perform two additional reprocessing events within the time frame 2013–2018 will allow proper handling of limitations as well as upgrading the dataset with new features (e.g. uncertainty estimates and extension of the temporal coverage.
SPICE: exploration and analysis of post-cytometric complex multivariate datasets.

Science.gov (United States)

Roederer, Mario; Nozzi, Joshua L; Nason, Martha C

2011-02-01

Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.

Directory of Open Access Journals (Sweden)

Douglas Teodoro

Full Text Available The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers

Science.gov (United States)

Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio

2018-01-01

The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms. PMID:29293556
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.

Science.gov (United States)

Teodoro, Douglas; Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio

2018-01-01

The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.
Dataset for Probabilistic estimation of residential air exchange rates for population-based exposure modeling

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset provides the city-specific air exchange rate measurements, modeled, literature-based as well as housing characteristics. This dataset is associated with...
Anonymising the Sparse Dataset: A New Privacy Preservation Approach while Predicting Diseases

Directory of Open Access Journals (Sweden)

V. Shyamala Susan

2016-09-01

Full Text Available Data mining techniques analyze the medical dataset with the intention of enhancing patient’s health and privacy. Most of the existing techniques are properly suited for low dimensional medical dataset. The proposed methodology designs a model for the representation of sparse high dimensional medical dataset with the attitude of protecting the patient’s privacy from an adversary and additionally to predict the disease’s threat degree. In a sparse data set many non-zero values are randomly spread in the entire data space. Hence, the challenge is to cluster the correlated patient’s record to predict the risk degree of the disease earlier than they occur in patients and to keep privacy. The first phase converts the sparse dataset right into a band matrix through the Genetic algorithm along with Cuckoo Search (GCS.This groups the correlated patient’s record together and arranges them close to the diagonal. The next segment dissociates the patient’s disease, which is a sensitive value (SA with the parameters that determine the disease normally Quasi Identifier (QI.Finally, density based clustering technique is used over the underlying data to create anonymized groups to maintain privacy and to predict the risk level of disease. Empirical assessments on actual health care data corresponding to V.A.Medical Centre heart disease dataset reveal the efficiency of this model pertaining to information loss, utility and privacy.
CoVennTree: A new method for the comparative analysis of large datasets

Directory of Open Access Journals (Sweden)

Steffen C. Lott

2015-02-01

Full Text Available The visualization of massive datasets, such as those resulting from comparative metatranscriptome analyses or the analysis of microbial population structures using ribosomal RNA sequences, is a challenging task. We developed a new method called CoVennTree (Comparative weighted Venn Tree that simultaneously compares up to three multifarious datasets by aggregating and propagating information from the bottom to the top level and produces a graphical output in Cytoscape. With the introduction of weighted Venn structures, the contents and relationships of various datasets can be correlated and simultaneously aggregated without losing information. We demonstrate the suitability of this approach using a dataset of 16S rDNA sequences obtained from microbial populations at three different depths of the Gulf of Aqaba in the Red Sea. CoVennTree has been integrated into the Galaxy ToolShed and can be directly downloaded and integrated into the user instance.
Dataset of Passerine bird communities in a Mediterranean high mountain (Sierra Nevada, Spain).

Science.gov (United States)

Pérez-Luque, Antonio Jesús; Barea-Azcón, José Miguel; Álvarez-Ruiz, Lola; Bonet-García, Francisco Javier; Zamora, Regino

2016-01-01

In this data paper, a dataset of passerine bird communities is described in Sierra Nevada, a Mediterranean high mountain located in southern Spain. The dataset includes occurrence data from bird surveys conducted in four representative ecosystem types of Sierra Nevada from 2008 to 2015. For each visit, bird species numbers as well as distance to the transect line were recorded. A total of 27847 occurrence records were compiled with accompanying measurements on distance to the transect and animal counts. All records are of species in the order Passeriformes. Records of 16 different families and 44 genera were collected. Some of the taxa in the dataset are included in the European Red List. This dataset belongs to the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area.
Vector Nonlinear Time-Series Analysis of Gamma-Ray Burst Datasets on Heterogeneous Clusters

Directory of Open Access Journals (Sweden)

Ioana Banicescu

2005-01-01

Full Text Available The simultaneous analysis of a number of related datasets using a single statistical model is an important problem in statistical computing. A parameterized statistical model is to be fitted on multiple datasets and tested for goodness of fit within a fixed analytical framework. Definitive conclusions are hopefully achieved by analyzing the datasets together. This paper proposes a strategy for the efficient execution of this type of analysis on heterogeneous clusters. Based on partitioning processors into groups for efficient communications and a dynamic loop scheduling approach for load balancing, the strategy addresses the variability of the computational loads of the datasets, as well as the unpredictable irregularities of the cluster environment. Results from preliminary tests of using this strategy to fit gamma-ray burst time profiles with vector functional coefficient autoregressive models on 64 processors of a general purpose Linux cluster demonstrate the effectiveness of the strategy.
Evaluation of Uncertainty in Precipitation Datasets for New Mexico, USA

Science.gov (United States)

Besha, A. A.; Steele, C. M.; Fernald, A.

2014-12-01

Climate change, population growth and other factors are endangering water availability and sustainability in semiarid/arid areas particularly in the southwestern United States. Wide coverage of spatial and temporal measurements of precipitation are key for regional water budget analysis and hydrological operations which themselves are valuable tool for water resource planning and management. Rain gauge measurements are usually reliable and accurate at a point. They measure rainfall continuously, but spatial sampling is limited. Ground based radar and satellite remotely sensed precipitation have wide spatial and temporal coverage. However, these measurements are indirect and subject to errors because of equipment, meteorological variability, the heterogeneity of the land surface itself and lack of regular recording. This study seeks to understand precipitation uncertainty and in doing so, lessen uncertainty propagation into hydrological applications and operations. We reviewed, compared and evaluated the TRMM (Tropical Rainfall Measuring Mission) precipitation products, NOAA's (National Oceanic and Atmospheric Administration) Global Precipitation Climatology Centre (GPCC) monthly precipitation dataset, PRISM (Parameter elevation Regression on Independent Slopes Model) data and data from individual climate stations including Cooperative Observer Program (COOP), Remote Automated Weather Stations (RAWS), Soil Climate Analysis Network (SCAN) and Snowpack Telemetry (SNOTEL) stations. Though not yet finalized, this study finds that the uncertainty within precipitation estimates datasets is influenced by regional topography, season, climate and precipitation rate. Ongoing work aims to further evaluate precipitation datasets based on the relative influence of these phenomena so that we can identify the optimum datasets for input to statewide water budget analysis.
Soil chemistry in lithologically diverse datasets: the quartz dilution effect

Science.gov (United States)

Bern, Carleton R.

2009-01-01

National- and continental-scale soil geochemical datasets are likely to move our understanding of broad soil geochemistry patterns forward significantly. Patterns of chemistry and mineralogy delineated from these datasets are strongly influenced by the composition of the soil parent material, which itself is largely a function of lithology and particle size sorting. Such controls present a challenge by obscuring subtler patterns arising from subsequent pedogenic processes. Here the effect of quartz concentration is examined in moist-climate soils from a pilot dataset of the North American Soil Geochemical Landscapes Project. Due to variable and high quartz contents (6.2–81.7 wt.%), and its residual and inert nature in soil, quartz is demonstrated to influence broad patterns in soil chemistry. A dilution effect is observed whereby concentrations of various elements are significantly and strongly negatively correlated with quartz. Quartz content drives artificial positive correlations between concentrations of some elements and obscures negative correlations between others. Unadjusted soil data show the highly mobile base cations Ca, Mg, and Na to be often strongly positively correlated with intermediately mobile Al or Fe, and generally uncorrelated with the relatively immobile high-field-strength elements (HFS) Ti and Nb. Both patterns are contrary to broad expectations for soils being weathered and leached. After transforming bulk soil chemistry to a quartz-free basis, the base cations are generally uncorrelated with Al and Fe, and negative correlations generally emerge with the HFS elements. Quartz-free element data may be a useful tool for elucidating patterns of weathering or parent-material chemistry in large soil datasets.
Validity and reliability of stillbirth data using linked self-reported and administrative datasets.

Science.gov (United States)

Hure, Alexis J; Chojenta, Catherine L; Powers, Jennifer R; Byles, Julie E; Loxton, Deborah

2015-01-01

A high rate of stillbirth was previously observed in the Australian Longitudinal Study of Women's Health (ALSWH). Our primary objective was to test the validity and reliability of self-reported stillbirth data linked to state-based administrative datasets. Self-reported data, collected as part of the ALSWH cohort born in 1973-1978, were linked to three administrative datasets for women in New South Wales, Australia (n = 4374): the Midwives Data Collection; Admitted Patient Data Collection; and Perinatal Death Review Database. Linkages were obtained from the Centre for Health Record Linkage for the period 1996-2009. True cases of stillbirth were defined by being consistently recorded in two or more independent data sources. Sensitivity, specificity, positive predictive value, negative predictive value, percent agreement, and kappa statistics were calculated for each dataset. Forty-nine women reported 53 stillbirths. No dataset was 100% accurate. The administrative datasets performed better than self-reported data, with high accuracy and agreement. Self-reported data showed high sensitivity (100%) but low specificity (30%), meaning women who had a stillbirth always reported it, but there was also over-reporting of stillbirths. About half of the misreported cases in the ALSWH were able to be removed by identifying inconsistencies in longitudinal data. Data linkage provides great opportunity to assess the validity and reliability of self-reported study data. Conversely, self-reported study data can help to resolve inconsistencies in administrative datasets. Quantifying the strengths and limitations of both self-reported and administrative data can improve epidemiological research, especially by guiding methods and interpretation of findings.

High End Visualization of Geophysical Datasets Using Immersive Technology: The SIO Visualization Center.

Science.gov (United States)

Newman, R. L.

2002-12-01

How many images can you display at one time with Power Point without getting "postage stamps"? Do you have fantastic datasets that you cannot view because your computer is too slow/small? Do you assume a few 2-D images of a 3-D picture are sufficient? High-end visualization centers can minimize and often eliminate these problems. The new visualization center [http://siovizcenter.ucsd.edu] at Scripps Institution of Oceanography [SIO] immerses users into a virtual world by projecting 3-D images onto a Panoram GVR-120E wall-sized floor-to-ceiling curved screen [7' x 23'] that has 3.2 mega-pixels of resolution. The Infinite Reality graphics subsystem is driven by a single-pipe SGI Onyx 3400 with a system bandwidth of 44 Gbps. The Onyx is powered by 16 MIPS R12K processors and 16 GB of addressable memory. The system is also equipped with transmitters and LCD shutter glasses which permit stereographic 3-D viewing of high-resolution images. This center is ideal for groups of up to 60 people who can simultaneously view these large-format images. A wide range of hardware and software is available, giving the users a totally immersive working environment in which to display, analyze, and discuss large datasets. The system enables simultaneous display of video and audio streams from sources such as SGI megadesktop and stereo megadesktop, S-VHS video, DVD video, and video from a Macintosh or PC. For instance, one-third of the screen might be displaying S-VHS video from a remotely-operated-vehicle [ROV], while the remaining portion of the screen might be used for an interactive 3-D flight over the same parcel of seafloor. The video and audio combinations using this system are numerous, allowing users to combine and explore data and images in innovative ways, greatly enhancing scientists' ability to visualize, understand and collaborate on complex datasets. In the not-distant future, with the rapid growth in networking speeds in the US, it will be possible for Earth Sciences
Selecting minimum dataset soil variables using PLSR as a regressive multivariate method

Science.gov (United States)

Stellacci, Anna Maria; Armenise, Elena; Castellini, Mirko; Rossi, Roberta; Vitti, Carolina; Leogrande, Rita; De Benedetto, Daniela; Ferrara, Rossana M.; Vivaldi, Gaetano A.

2017-04-01

Long-term field experiments and science-based tools that characterize soil status (namely the soil quality indices, SQIs) assume a strategic role in assessing the effect of agronomic techniques and thus in improving soil management especially in marginal environments. Selecting key soil variables able to best represent soil status is a critical step for the calculation of SQIs. Current studies show the effectiveness of statistical methods for variable selection to extract relevant information deriving from multivariate datasets. Principal component analysis (PCA) has been mainly used, however supervised multivariate methods and regressive techniques are progressively being evaluated (Armenise et al., 2013; de Paul Obade et al., 2016; Pulido Moncada et al., 2014). The present study explores the effectiveness of partial least square regression (PLSR) in selecting critical soil variables, using a dataset comparing conventional tillage and sod-seeding on durum wheat. The results were compared to those obtained using PCA and stepwise discriminant analysis (SDA). The soil data derived from a long-term field experiment in Southern Italy. On samples collected in April 2015, the following set of variables was quantified: (i) chemical: total organic carbon and nitrogen (TOC and TN), alkali-extractable C (TEC and humic substances - HA-FA), water extractable N and organic C (WEN and WEOC), Olsen extractable P, exchangeable cations, pH and EC; (ii) physical: texture, dry bulk density (BD), macroporosity (Pmac), air capacity (AC), and relative field capacity (RFC); (iii) biological: carbon of the microbial biomass quantified with the fumigation-extraction method. PCA and SDA were previously applied to the multivariate dataset (Stellacci et al., 2016). PLSR was carried out on mean centered and variance scaled data of predictors (soil variables) and response (wheat yield) variables using the PLS procedure of SAS/STAT. In addition, variable importance for projection (VIP
Traffic sign classification with dataset augmentation and convolutional neural network

Science.gov (United States)

Tang, Qing; Kurnianggoro, Laksono; Jo, Kang-Hyun

2018-04-01

This paper presents a method for traffic sign classification using a convolutional neural network (CNN). In this method, firstly we transfer a color image into grayscale, and then normalize it in the range (-1,1) as the preprocessing step. To increase robustness of classification model, we apply a dataset augmentation algorithm and create new images to train the model. To avoid overfitting, we utilize a dropout module before the last fully connection layer. To assess the performance of the proposed method, the German traffic sign recognition benchmark (GTSRB) dataset is utilized. Experimental results show that the method is effective in classifying traffic signs.
MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark

Science.gov (United States)

Qin, Li-Xuan; Zhou, Qin

2014-01-01

MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456
Translanguaging Practices at a Bilingual University: A Case Study of a Science Classroom

Science.gov (United States)

Mazak, Catherine M.; Herbas-Donoso, Claudia

2015-01-01

The objective of this ethnographic case study is to describe in detail one professor's translanguaging practices in an undergraduate science course at an officially bilingual university. The data-set is comprised of ethnographic field notes of 11 observed classes, audio recordings of those classes, an interview with the professor, and artifacts…
A global gridded dataset of daily precipitation going back to 1950, ideal for analysing precipitation extremes

Science.gov (United States)

Contractor, S.; Donat, M.; Alexander, L. V.

2017-12-01

Reliable observations of precipitation are necessary to determine past changes in precipitation and validate models, allowing for reliable future projections. Existing gauge based gridded datasets of daily precipitation and satellite based observations contain artefacts and have a short length of record, making them unsuitable to analyse precipitation extremes. The largest limiting factor for the gauge based datasets is a dense and reliable station network. Currently, there are two major data archives of global in situ daily rainfall data, first is Global Historical Station Network (GHCN-Daily) hosted by National Oceanic and Atmospheric Administration (NOAA) and the other by Global Precipitation Climatology Centre (GPCC) part of the Deutsche Wetterdienst (DWD). We combine the two data archives and use automated quality control techniques to create a reliable long term network of raw station data, which we then interpolate using block kriging to create a global gridded dataset of daily precipitation going back to 1950. We compare our interpolated dataset with existing global gridded data of daily precipitation: NOAA Climate Prediction Centre (CPC) Global V1.0 and GPCC Full Data Daily Version 1.0, as well as various regional datasets. We find that our raw station density is much higher than other datasets. To avoid artefacts due to station network variability, we provide multiple versions of our dataset based on various completeness criteria, as well as provide the standard deviation, kriging error and number of stations for each grid cell and timestep to encourage responsible use of our dataset. Despite our efforts to increase the raw data density, the in situ station network remains sparse in India after the 1960s and in Africa throughout the timespan of the dataset. Our dataset would allow for more reliable global analyses of rainfall including its extremes and pave the way for better global precipitation observations with lower and more transparent uncertainties.
Global Man-made Impervious Surface (GMIS) Dataset From Landsat

Data.gov (United States)

National Aeronautics and Space Administration — The Global Man-made Impervious Surface (GMIS) Dataset From Landsat consists of global estimates of fractional impervious cover derived from the Global Land Survey...
Semiparametric regression for the social sciences

CERN Document Server

Keele, Luke John

2008-01-01

An introductory guide to smoothing techniques, semiparametric estimators, and their related methods, this book describes the methodology via a selection of carefully explained examples and data sets. It also demonstrates the potential of these techniques using detailed empirical examples drawn from the social and political sciences. Each chapter includes exercises and examples and there is a supplementary website containing all the datasets used, as well as computer code, allowing readers to replicate every analysis reported in the book. Includes software for implementing the methods in S-Plus and R.
Dataset: Multi Sensor-Orientation Movement Data of Goats

NARCIS (Netherlands)

Kamminga, Jacob Wilhelm

2018-01-01

This is a labeled dataset. Motion data were collected from six sensor nodes that were fixed with different orientations to a collar around the neck of goats. These six sensor nodes simultaneously, with different orientations, recorded various activities performed by the goat. We recorded the
ProDaMa: an open source Python library to generate protein structure datasets.

Science.gov (United States)

Armano, Giuliano; Manconi, Andrea

2009-10-02

The huge difference between the number of known sequences and known tertiary structures has justified the use of automated methods for protein analysis. Although a general methodology to solve these problems has not been yet devised, researchers are engaged in developing more accurate techniques and algorithms whose training plays a relevant role in determining their performance. From this perspective, particular importance is given to the training data used in experiments, and researchers are often engaged in the generation of specialized datasets that meet their requirements. To facilitate the task of generating specialized datasets we devised and implemented ProDaMa, an open source Python library than provides classes for retrieving, organizing, updating, analyzing, and filtering protein data. ProDaMa has been used to generate specialized datasets useful for secondary structure prediction and to develop a collaborative web application aimed at generating and sharing protein structure datasets. The library, the related database, and the documentation are freely available at the URL http://iasc.diee.unica.it/prodama.
Dataset of Passerine bird communities in a Mediterranean high mountain (Sierra Nevada, Spain)

Science.gov (United States)

Pérez-Luque, Antonio Jesús; Barea-Azcón, José Miguel; Álvarez-Ruiz, Lola; Bonet-García, Francisco Javier; Zamora, Regino

2016-01-01

Abstract In this data paper, a dataset of passerine bird communities is described in Sierra Nevada, a Mediterranean high mountain located in southern Spain. The dataset includes occurrence data from bird surveys conducted in four representative ecosystem types of Sierra Nevada from 2008 to 2015. For each visit, bird species numbers as well as distance to the transect line were recorded. A total of 27847 occurrence records were compiled with accompanying measurements on distance to the transect and animal counts. All records are of species in the order Passeriformes. Records of 16 different families and 44 genera were collected. Some of the taxa in the dataset are included in the European Red List. This dataset belongs to the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:26865820
GES DISC Datalist Improves Earth Science Data Discoverability

Science.gov (United States)

Li, A.; Teng, W.; Hegde, M.; Petrenko, M.; Shen, S.; Shie, C.; Liu, Z.; Hearty, T.; Bryant, K.; Vollmer, B.;

2017-01-01

At American Geophysical Union(AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a novel way to access data: Datalist. Currently, datalist is a collection of predefined data variables from one or more archived datasets, curated by our subject matter expert (SME). Our science support team has curated a predefined Hurricane Datalist and received very positive feedback from the user community. Datalist uses the same architecture our new website uses and have the same look and feel as other datasets on our web site. and also provides a one-stop shopping for data, metadata, citation, documentation, visualization and other available services. Since the last AGU Meeting, we have further developed a few new datalists corresponding to the Big Earth Data Initiative (BEDI) Societal Benefit Areas and A-Train data. We now have four datalists: Hurricane, Wind Energy, Greenhouse Gas and A-Train. We have also started working with our User Working Group members to create their favorite datalists and working with other DAAC to explore the possibility to include their products in our datalists that may also lead to a future of potential federated (cross-DAAC) datalists. Since our datalist prototype effort was a success, we are planning to make datalist operational. It's extremely important to have a common metadata model to support datalist, this will also be the foundation of federated datalist. We mapped our datalist metadata model to the unpublished UMM(Universal Metadata Model)-Var (Variable) (June version) and found that the UMM-var together with UMM-C (Collection) and possible UMM-S (Service) will meet our basic requirements. For example: Dataset shortname, and version are already specified in UMM-C, variable name, long name, units, dimensions are all specified in UMM-Var. UMM-Var also facilitates Science Keywords to allow tagging at variable level and Characteristics for optional variable characteristics. Measurements is useful

GES DISC Datalist Improves Earth Science Data Discoverbility

Science.gov (United States)

Li, A.; Teng, W. L.; Hegde, M.; Petrenko, M.; Shen, S.; Shie, C. L.; Liu, Z.; Hearty, T.; Bryant, K.; Vollmer, B.; Meyer, D. J.

2017-12-01

At American Geophysical Union(AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a novel way to access data: Datalist. Currently, datalist is a collection of predefined data variables from one or more archived datasets, curated by our subject matter expert (SME). Our science support team has curated a predefined Hurricane Datalist and received very positive feedback from the user community. Datalist uses the same architecture our new website uses and have the same look and feel as other datasets on our web site. and also provides a one-stop shopping for data, metadata, citation, documentation, visualization and other available services. Since the last AGU Meeting, we have further developed a few new datalists corresponding to the Big Earth Data Initiative (BEDI) Societal Benefit Areas and A-Train data. We now have four datalists: Hurricane, Wind Energy, Greenhouse Gas and A-Train. We have also started working with our User Working Group members to create their favorite datalists and working with other DAAC to explore the possibility to include their products in our datalists that may also lead to a future of potential federated (cross-DAAC) datalists. Since our datalist prototype effort was a success, we are planning to make datalist operational. It's extremely important to have a common metadata model to support datalist, this will also be the foundation of federated datalist. We mapped our datalist metadata model to the unpublished UMM(Universal Metadata Model)-Var (Variable) (June version) and found that the UMM-var together with UMM-C (Collection) and possible UMM-S (Service) will meet our basic requirements. For example: Dataset shortname, and version are already specified in UMM-C, variable name, long name, units, dimensions are all specified in UMM-Var. UMM-Var also facilitates ScienceKeywords to allow tagging at variable level and Characteristics for optional variable characteristics. Measurements is useful
'Tagger' - a Mac OS X Interactive Graphical Application for Data Inference and Analysis of N-Dimensional Datasets in the Natural Physical Sciences.

Science.gov (United States)

Morse, P. E.; Reading, A. M.; Lueg, C.

2014-12-01

observations of important or noticeable events. Other visualisations and modes of interaction will also be demonstrated, with the aim of discovering knowledge in large datasets in the natural, physical sciences. Fig.1 Wave height data from an oceanographic Wave Rider Buoy. Colors/radii are driven by wave height data.
Science friction: data, metadata, and collaboration.

Science.gov (United States)

Edwards, Paul N; Mayernik, Matthew S; Batcheller, Archer L; Bowker, Geoffrey C; Borgman, Christine L

2011-10-01

When scientists from two or more disciplines work together on related problems, they often face what we call 'science friction'. As science becomes more data-driven, collaborative, and interdisciplinary, demand increases for interoperability among data, tools, and services. Metadata--usually viewed simply as 'data about data', describing objects such as books, journal articles, or datasets--serve key roles in interoperability. Yet we find that metadata may be a source of friction between scientific collaborators, impeding data sharing. We propose an alternative view of metadata, focusing on its role in an ephemeral process of scientific communication, rather than as an enduring outcome or product. We report examples of highly useful, yet ad hoc, incomplete, loosely structured, and mutable, descriptions of data found in our ethnographic studies of several large projects in the environmental sciences. Based on this evidence, we argue that while metadata products can be powerful resources, usually they must be supplemented with metadata processes. Metadata-as-process suggests the very large role of the ad hoc, the incomplete, and the unfinished in everyday scientific work.
The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: National Anthropogenic Barrier Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset represents the dam density and storage volumes within individual, local NHDPlusV2 catchments and upstream, contributing watersheds based on the National...
Evaluation of Modified Categorical Data Fuzzy Clustering Algorithm on the Wisconsin Breast Cancer Dataset

Directory of Open Access Journals (Sweden)

Amir Ahmad

2016-01-01

Full Text Available The early diagnosis of breast cancer is an important step in a fight against the disease. Machine learning techniques have shown promise in improving our understanding of the disease. As medical datasets consist of data points which cannot be precisely assigned to a class, fuzzy methods have been useful for studying of these datasets. Sometimes breast cancer datasets are described by categorical features. Many fuzzy clustering algorithms have been developed for categorical datasets. However, in most of these methods Hamming distance is used to define the distance between the two categorical feature values. In this paper, we use a probabilistic distance measure for the distance computation among a pair of categorical feature values. Experiments demonstrate that the distance measure performs better than Hamming distance for Wisconsin breast cancer data.
The Federation of Earth Science Information Partners (ESIP Federation): Facilitating Partnerships that Work to Bring Earth Science Data into Educational Settings

Science.gov (United States)

Freuder, R.; Ledley, T. S.; Dahlman, L.

2004-12-01

The Federation of Earth Science Information Partners (ESIP Federation, http://www.esipfed.org) formed seven years ago and now with 77 member organizations is working to "increase the quality and value of Earth science products and services .for the benefit of the ESIP Federation's stakeholder communities." Education (both formal and informal) is a huge audience that we serve. Partnerships formed by members within the ESIP Federation have created bridges that close the gap between Earth science data collection and research and the effective use of that Earth science data to explore concepts in Earth system science by the educational community. The Earth Exploration Toolbook is one of those successful collaborations. The Earth Exploration Toolbook (EET, http://serc.carleton.edu/eet) grew out of a need of the educational community (articulated by the Digital Library for Earth System Education (DLESE) community) to have better access to Earth science data and data analysis tools and help in effectively using them with students. It is a collection of web-accessible chapters, each featuring step-by-step instructions on how to use an Earth science dataset and data analysis tool to investigate an issue or concept in Earth system science. Each chapter also provides the teacher information on the outcome of the activity, grade level, standards addressed, learning goals, time required, and ideas for exploring further. The individual ESIP Federation partners alone could not create the EET. However, the ESIP Federation facilitated the partnering of members, drawing from data providers, researchers and education tool developers, to create the EET. Interest in the EET has grown since it went live with five chapters in July 2003. There are currently seven chapters with another six soon to be released. Monthly online seminars in which over a hundred educators have participated have given very positive feedback. Post workshop surveys from our telecon-online workshops indicate that
Data science implications in diamond formation and craton evolution

Science.gov (United States)

Pan, F.; Huang, F.; Fox, P. A.

2017-12-01

Diamonds are so-called "messengers" from the deep Earth. Fluid and mineral inclusions in diamonds could reflect the compositions of fluids/melts and wall-rocks in which diamond formed. Recently many diamond samples are examined to study the water content in the mantle transition zone1, the mechanism of diamond formation2 and the mantle evolution history3. However, most of the studies can only explain local activities. Therefore, an overall project of data grouping, comparison and correlation is needed, but limited progress has been made due to a lack of benchmark datasets on diamond formation and effective computing algorithms. In this study, we start by proposing the very first complete and easily-accessible dataset on mineral and fluid inclusions in diamonds. We rescue, collect and organize the data available from papers, journals and other publications resources ([2-4] and more), and then apply several state-of-the-art machine learning methods to tackle this earth science problem by clustering diamond formation process into distinct groups primarily based on the compositions, the formation temperature and pressure, the age and so on. Our ongoing work includes further data exploration and training existing models. Our preliminary results show that diamonds formed from older cratons usually have higher formation temperature. Also peridotitic diamonds take a much larger population than the ecologitic ones. More details are being discovered when we finish constructing the database and training our model. We expect the result to demonstrate the advantages of using machine learning and data science in earth science research problems. Our methodology for knowledge discovery are very general and can be broadly applied to other earth science research problems under the same framework.[1] Pearson et al, Nature (2014); [2] Tomlinson et al, EPSL (2006); [3] Weiss et al, Nature (2016); [4] Stachel and Harris, Ore Geology Reviews (2008); Weiss et al, EPSL (2013)
When I Grow Up: The Relationship of "Science Learning Activation" to STEM Career Preferences

Science.gov (United States)

Dorph, Rena; Bathgate, Meghan E.; Schunn, Christian D.; Cannady, Matthew A.

2018-01-01

This paper proposes three new measures of components STEM career preferences (affinity, certainty, and goal), and then explores which dimensions of "science learning activation" (fascination, values, competency belief, and scientific sensemaking) are predictive of STEM career preferences. Drawn from the ALES14 dataset, a sample of 2938…

Collections management plan for the U.S. Geological Survey Woods Hole Coastal and Marine Science Center Data Library

Science.gov (United States)

List, Kelleen M.; Buczkowski, Brian J.; McCarthy, Linda P.; Orton, Alice M.

2015-08-17

The U.S. Geological Survey Woods Hole Coastal and Marine Science Center has created a Data Library to organize, preserve, and make available the field, laboratory, and modeling data collected and processed by Woods Hole Coastal and Marine Science Center staff. This Data Library supports current research efforts by providing unique, historic datasets with accompanying metadata. The Woods Hole Coastal and Marine Science Center’s Data Library has custody of historic data and records that are still useful for research, and assists with preservation and distribution of marine science records and data in the course of scientific investigation and experimentation by researchers and staff at the science center.
NEW WEB-BASED ACCESS TO NUCLEAR STRUCTURE DATASETS.

Energy Technology Data Exchange (ETDEWEB)

WINCHELL,D.F.

2004-09-26

As part of an effort to migrate the National Nuclear Data Center (NNDC) databases to a relational platform, a new web interface has been developed for the dissemination of the nuclear structure datasets stored in the Evaluated Nuclear Structure Data File and Experimental Unevaluated Nuclear Data List.
Cross-Dataset Analysis and Visualization Driven by Expressive Web Services

Science.gov (United States)

Alexandru Dumitru, Mircea; Catalin Merticariu, Vlad

2015-04-01

The deluge of data that is hitting us every day from satellite and airborne sensors is changing the workflow of environmental data analysts and modelers. Web geo-services play now a fundamental role, and are no longer needed to preliminary download and store the data, but rather they interact in real-time with GIS applications. Due to the very large amount of data that is curated and made available by web services, it is crucial to deploy smart solutions for optimizing network bandwidth, reducing duplication of data and moving the processing closer to the data. In this context we have created a visualization application for analysis and cross-comparison of aerosol optical thickness datasets. The application aims to help researchers identify and visualize discrepancies between datasets coming from various sources, having different spatial and time resolutions. It also acts as a proof of concept for integration of OGC Web Services under a user-friendly interface that provides beautiful visualizations of the explored data. The tool was built on top of the World Wind engine, a Java based virtual globe built by NASA and the open source community. For data retrieval and processing we exploited the OGC Web Coverage Service potential: the most exciting aspect being its processing extension, a.k.a. the OGC Web Coverage Processing Service (WCPS) standard. A WCPS-compliant service allows a client to execute a processing query on any coverage offered by the server. By exploiting a full grammar, several different kinds of information can be retrieved from one or more datasets together: scalar condensers, cross-sectional profiles, comparison maps and plots, etc. This combination of technology made the application versatile and portable. As the processing is done on the server-side, we ensured that the minimal amount of data is transferred and that the processing is done on a fully-capable server, leaving the client hardware resources to be used for rendering the visualization
SatelliteDL: a Toolkit for Analysis of Heterogeneous Satellite Datasets

Science.gov (United States)

Galloy, M. D.; Fillmore, D.

2014-12-01

SatelliteDL is an IDL toolkit for the analysis of satellite Earth observations from a diverse set of platforms and sensors. The core function of the toolkit is the spatial and temporal alignment of satellite swath and geostationary data. The design features an abstraction layer that allows for easy inclusion of new datasets in a modular way. Our overarching objective is to create utilities that automate the mundane aspects of satellite data analysis, are extensible and maintainable, and do not place limitations on the analysis itself. IDL has a powerful suite of statistical and visualization tools that can be used in conjunction with SatelliteDL. Toward this end we have constructed SatelliteDL to include (1) HTML and LaTeX API document generation,(2) a unit test framework,(3) automatic message and error logs,(4) HTML and LaTeX plot and table generation, and(5) several real world examples with bundled datasets available for download. For ease of use, datasets, variables and optional workflows may be specified in a flexible format configuration file. Configuration statements may specify, for example, a region and date range, and the creation of images, plots and statistical summary tables for a long list of variables. SatelliteDL enforces data provenance; all data should be traceable and reproducible. The output NetCDF file metadata holds a complete history of the original datasets and their transformations, and a method exists to reconstruct a configuration file from this information. Release 0.1.0 distributes with ingest methods for GOES, MODIS, VIIRS and CERES radiance data (L1) as well as select 2D atmosphere products (L2) such as aerosol and cloud (MODIS and VIIRS) and radiant flux (CERES). Future releases will provide ingest methods for ocean and land surface products, gridded and time averaged datasets (L3 Daily, Monthly and Yearly), and support for 3D products such as temperature and water vapor profiles. Emphasis will be on NPP Sensor, Environmental and
Augmented Reality Prototype for Visualizing Large Sensors’ Datasets

Directory of Open Access Journals (Sweden)

Folorunso Olufemi A.

2011-04-01

Full Text Available This paper addressed the development of an augmented reality (AR based scientific visualization system prototype that supports identification, localisation, and 3D visualisation of oil leakages sensors datasets. Sensors generates significant amount of multivariate datasets during normal and leak situations which made data exploration and visualisation daunting tasks. Therefore a model to manage such data and enhance computational support needed for effective explorations are developed in this paper. A challenge of this approach is to reduce the data inefficiency. This paper presented a model for computing information gain for each data attributes and determine a lead attribute.The computed lead attribute is then used for the development of an AR-based scientific visualization interface which automatically identifies, localises and visualizes all necessary data relevant to a particularly selected region of interest (ROI on the network. Necessary architectural system supports and the interface requirements for such visualizations are also presented.
Datasets collected in general practice: an international comparison using the example of obesity.

Science.gov (United States)

Sturgiss, Elizabeth; van Boven, Kees

2018-06-04

International datasets from general practice enable the comparison of how conditions are managed within consultations in different primary healthcare settings. The Australian Bettering the Evaluation and Care of Health (BEACH) and TransHIS from the Netherlands collect in-consultation general practice data that have been used extensively to inform local policy and practice. Obesity is a global health issue with different countries applying varying approaches to management. The objective of the present paper is to compare the primary care management of obesity in Australia and the Netherlands using data collected from consultations. Despite the different prevalence in obesity in the two countries, the number of patients per 1000 patient-years seen with obesity is similar. Patients in Australia with obesity are referred to allied health practitioners more often than Dutch patients. Without quality general practice data, primary care researchers will not have data about the management of conditions within consultations. We use obesity to highlight the strengths of these general practice data sources and to compare their differences. What is known about the topic? Australia had one of the longest-running consecutive datasets about general practice activity in the world, but it has recently lost government funding. The Netherlands has a longitudinal general practice dataset of information collected within consultations since 1985. What does this paper add? We discuss the benefits of general practice-collected data in two countries. Using obesity as a case example, we compare management in general practice between Australia and the Netherlands. This type of analysis should start all international collaborations of primary care management of any health condition. Having a national general practice dataset allows international comparisons of the management of conditions with primary care. Without a current, quality general practice dataset, primary care researchers will not
A Dataset of Three Educational Technology Experiments on Differentiation, Formative Testing and Feedback

Science.gov (United States)

Haelermans, Carla; Ghysels, Joris; Prince, Fernao

2015-01-01

This paper describes a dataset with data from three individually randomized educational technology experiments on differentiation, formative testing and feedback during one school year for a group of 8th grade students in the Netherlands, using administrative data and the online motivation questionnaire of Boekaerts. The dataset consists of pre-…
New public dataset for spotting patterns in medieval document images

Science.gov (United States)

En, Sovann; Nicolas, Stéphane; Petitjean, Caroline; Jurie, Frédéric; Heutte, Laurent

2017-01-01

With advances in technology, a large part of our cultural heritage is becoming digitally available. In particular, in the field of historical document image analysis, there is now a growing need for indexing and data mining tools, thus allowing us to spot and retrieve the occurrences of an object of interest, called a pattern, in a large database of document images. Patterns may present some variability in terms of color, shape, or context, making the spotting of patterns a challenging task. Pattern spotting is a relatively new field of research, still hampered by the lack of available annotated resources. We present a new publicly available dataset named DocExplore dedicated to spotting patterns in historical document images. The dataset contains 1500 images and 1464 queries, and allows the evaluation of two tasks: image retrieval and pattern localization. A standardized benchmark protocol along with ad hoc metrics is provided for a fair comparison of the submitted approaches. We also provide some first results obtained with our baseline system on this new dataset, which show that there is room for improvement and that should encourage researchers of the document image analysis community to design new systems and submit improved results.
Common integration sites of published datasets identified using a graph-based framework

Directory of Open Access Journals (Sweden)

Alessandro Vasciaveo

2016-01-01

Full Text Available With next-generation sequencing, the genomic data available for the characterization of integration sites (IS has dramatically increased. At present, in a single experiment, several thousand viral integration genome targets can be investigated to define genomic hot spots. In a previous article, we renovated a formal CIS analysis based on a rigid fixed window demarcation into a more stretchy definition grounded on graphs. Here, we present a selection of supporting data related to the graph-based framework (GBF from our previous article, in which a collection of common integration sites (CIS was identified on six published datasets. In this work, we will focus on two datasets, ISRTCGD and ISHIV, which have been previously discussed. Moreover, we show in more detail the workflow design that originates the datasets.
Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A Normative Database Created from Control Datasets.

Science.gov (United States)

de Vent, Nathalie R; Agelink van Rentergem, Joost A; Schmand, Ben A; Murre, Jaap M J; Huizenga, Hilde M

2016-01-01

In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity, and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given.
GLEAM version 3: Global Land Evaporation Datasets and Model

Science.gov (United States)

Martens, B.; Miralles, D. G.; Lievens, H.; van der Schalie, R.; de Jeu, R.; Fernandez-Prieto, D.; Verhoest, N.

2015-12-01

Terrestrial evaporation links energy, water and carbon cycles over land and is therefore a key variable of the climate system. However, the global-scale magnitude and variability of the flux, and the sensitivity of the underlying physical process to changes in environmental factors, are still poorly understood due to limitations in in situ measurements. As a result, several methods have risen to estimate global patterns of land evaporation from satellite observations. However, these algorithms generally differ in their approach to model evaporation, resulting in large differences in their estimates. One of these methods is GLEAM, the Global Land Evaporation: the Amsterdam Methodology. GLEAM estimates terrestrial evaporation based on daily satellite observations of meteorological variables, vegetation characteristics and soil moisture. Since the publication of the first version of the algorithm (2011), the model has been widely applied to analyse trends in the water cycle and land-atmospheric feedbacks during extreme hydrometeorological events. A third version of the GLEAM global datasets is foreseen by the end of 2015. Given the relevance of having a continuous and reliable record of global-scale evaporation estimates for climate and hydrological research, the establishment of an online data portal to host these data to the public is also foreseen. In this new release of the GLEAM datasets, different components of the model have been updated, with the most significant change being the revision of the data assimilation algorithm. In this presentation, we will highlight the most important changes of the methodology and present three new GLEAM datasets and their validation against in situ observations and an alternative dataset of terrestrial evaporation (ERA-Land). Results of the validation exercise indicate that the magnitude and the spatiotemporal variability of the modelled evaporation agree reasonably well with the estimates of ERA-Land and the in situ
Dataset of herbarium specimens of threatened vascular plants in Catalonia.

Science.gov (United States)

Nualart, Neus; Ibáñez, Neus; Luque, Pere; Pedrol, Joan; Vilar, Lluís; Guàrdia, Roser

2017-01-01

This data paper describes a specimens' dataset of the Catalonian threatened vascular plants conserved in five public Catalonian herbaria (BC, BCN, HGI, HBIL and MTTE). Catalonia is an administrative region of Spain that includes large autochthon plants diversity and 199 taxa with IUCN threatened categories (EX, EW, RE, CR, EN and VU). This dataset includes 1,618 records collected from 17 th century to nowadays. For each specimen, the species name, locality indication, collection date, collector, ecology and revision label are recorded. More than 94% of the taxa are represented in the herbaria, which evidence the paper of the botanical collections as an essential source of occurrence data.
A robust post-processing workflow for datasets with motion artifacts in diffusion kurtosis imaging.

Science.gov (United States)

Li, Xianjun; Yang, Jian; Gao, Jie; Luo, Xue; Zhou, Zhenyu; Hu, Yajie; Wu, Ed X; Wan, Mingxi

2014-01-01

The aim of this study was to develop a robust post-processing workflow for motion-corrupted datasets in diffusion kurtosis imaging (DKI). The proposed workflow consisted of brain extraction, rigid registration, distortion correction, artifacts rejection, spatial smoothing and tensor estimation. Rigid registration was utilized to correct misalignments. Motion artifacts were rejected by using local Pearson correlation coefficient (LPCC). The performance of LPCC in characterizing relative differences between artifacts and artifact-free images was compared with that of the conventional correlation coefficient in 10 randomly selected DKI datasets. The influence of rejected artifacts with information of gradient directions and b values for the parameter estimation was investigated by using mean square error (MSE). The variance of noise was used as the criterion for MSEs. The clinical practicality of the proposed workflow was evaluated by the image quality and measurements in regions of interest on 36 DKI datasets, including 18 artifact-free (18 pediatric subjects) and 18 motion-corrupted datasets (15 pediatric subjects and 3 essential tremor patients). The relative difference between artifacts and artifact-free images calculated by LPCC was larger than that of the conventional correlation coefficient (pworkflow improved the image quality and reduced the measurement biases significantly on motion-corrupted datasets (pworkflow was reliable to improve the image quality and the measurement precision of the derived parameters on motion-corrupted DKI datasets. The workflow provided an effective post-processing method for clinical applications of DKI in subjects with involuntary movements.
Using Real Datasets for Interdisciplinary Business/Economics Projects

Science.gov (United States)

Goel, Rajni; Straight, Ronald L.

2005-01-01

The workplace's global and dynamic nature allows and requires improved approaches for providing business and economics education. In this article, the authors explore ways of enhancing students' understanding of course material by using nontraditional, real-world datasets of particular interest to them. Teaching at a historically Black university,…
The Most Common Geometric and Semantic Errors in CityGML Datasets

Science.gov (United States)

Biljecki, F.; Ledoux, H.; Du, X.; Stoter, J.; Soon, K. H.; Khoo, V. H. S.

2016-10-01

To be used as input in most simulation and modelling software, 3D city models should be geometrically and topologically valid, and semantically rich. We investigate in this paper what is the quality of currently available CityGML datasets, i.e. we validate the geometry/topology of the 3D primitives (Solid and MultiSurface), and we validate whether the semantics of the boundary surfaces of buildings is correct or not. We have analysed all the CityGML datasets we could find, both from portals of cities and on different websites, plus a few that were made available to us. We have thus validated 40M surfaces in 16M 3D primitives and 3.6M buildings found in 37 CityGML datasets originating from 9 countries, and produced by several companies with diverse software and acquisition techniques. The results indicate that CityGML datasets without errors are rare, and those that are nearly valid are mostly simple LOD1 models. We report on the most common errors we have found, and analyse them. One main observation is that many of these errors could be automatically fixed or prevented with simple modifications to the modelling software. Our principal aim is to highlight the most common errors so that these are not repeated in the future. We hope that our paper and the open-source software we have developed will help raise awareness for data quality among data providers and 3D GIS software producers.
Spatially continuous dataset at local scale of Taita Hills in Kenya and Mount Kilimanjaro in Tanzania

Directory of Open Access Journals (Sweden)

Sizah Mwalusepo

2016-09-01

Full Text Available Climate change is a global concern, requiring local scale spatially continuous dataset and modeling of meteorological variables. This dataset article provided the interpolated temperature, rainfall and relative humidity dataset at local scale along Taita Hills and Mount Kilimanjaro altitudinal gradients in Kenya and Tanzania, respectively. The temperature and relative humidity were recorded hourly using automatic onset THHOBO data loggers and rainfall was recorded daily using GENERALR wireless rain gauges. Thin plate spline (TPS was used to interpolate, with the degree of data smoothing determined by minimizing the generalized cross validation. The dataset provide information on the status of the current climatic conditions along the two mountainous altitudinal gradients in Kenya and Tanzania. The dataset will, thus, enhance future research. Keywords: Spatial climate data, Climate change, Modeling, Local scale
Stewardship of NASA's Earth Science Data and Ensuring Long-Term Active Archives

Science.gov (United States)

Ramapriyan, Hampapuram K.; Behnke, Jeanne

2016-01-01

Program, NASA has followed an open data policy, with non-discriminatory access to data with no period of exclusive access. NASA has well-established processes for assigning and or accepting datasets into one of 12 Distributed Active Archive Centers (DAACs) that are parts of EOSDIS. EOSDIS has been evolving through several information technology cycles, adapting to hardware and software changes in the commercial sector. NASA is responsible for maintaining Earth science data as long as users are interested in using them for research and applications, which is well beyond the life of the data gathering missions. For science data to remain useful over long periods of time, steps must be taken to preserve: (1) Data bits with no corruption, (2) Discoverability and access, (3) Readability, (4) Understandability, (5) Usability' and (6). Reproducibility of results. NASAs Earth Science data and Information System (ESDIS) Project, along with the 12 EOSDIS Distributed Active Archive Centers (DAACs), has made significant progress in each of these areas over the last decade, and continues to evolve its active archive capabilities. Particular attention is being paid in recent years to ensure that the datasets are published in an easily accessible and citable manner through a unified metadata model, a common metadata repository (CMR), a coherent view through the earthdata.gov website, and assignment of Digital Object Identifiers (DOI) with well-designed landing product information pages.
The impact of the resolution of meteorological datasets on catchment-scale drought studies

Science.gov (United States)

Hellwig, Jost; Stahl, Kerstin

2017-04-01

Gridded meteorological datasets provide the basis to study drought at a range of scales, including catchment scale drought studies in hydrology. They are readily available to study past weather conditions and often serve real time monitoring as well. As these datasets differ in spatial/temporal coverage and spatial/temporal resolution, for most studies there is a tradeoff between these features. Our investigation examines whether biases occur when studying drought on catchment scale with low resolution input data. For that, a comparison among the datasets HYRAS (covering Central Europe, 1x1 km grid, daily data, 1951 - 2005), E-OBS (Europe, 0.25° grid, daily data, 1950-2015) and GPCC (whole world, 0.5° grid, monthly data, 1901 - 2013) is carried out. Generally, biases in precipitation increase with decreasing resolution. Most important variations are found during summer. In low mountain range of Central Europe the datasets of sparse resolution (E-OBS, GPCC) overestimate dry days and underestimate total precipitation since they are not able to describe high spatial variability. However, relative measures like the correlation coefficient reveal good consistencies of dry and wet periods, both for absolute precipitation values and standardized indices like the Standardized Precipitation Index (SPI) or Standardized Precipitation Evaporation Index (SPEI). Particularly the most severe droughts derived from the different datasets match very well. These results indicate that absolute values of sparse resolution datasets applied to catchment scale might be critical to use for an assessment of the hydrological drought at catchment scale, whereas relative measures for determining periods of drought are more trustworthy. Therefore, studies on drought, that downscale meteorological data, should carefully consider their data needs and focus on relative measures for dry periods if sufficient for the task.
Handling limited datasets with neural networks in medical applications: A small-data approach.

Science.gov (United States)

Shaikhina, Torgyn; Khovanova, Natalia A

2017-01-01

Single-centre studies in medical domain are often characterised by limited samples due to the complexity and high costs of patient data collection. Machine learning methods for regression modelling of small datasets (less than 10 observations per predictor variable) remain scarce. Our work bridges this gap by developing a novel framework for application of artificial neural networks (NNs) for regression tasks involving small medical datasets. In order to address the sporadic fluctuations and validation issues that appear in regression NNs trained on small datasets, the method of multiple runs and surrogate data analysis were proposed in this work. The approach was compared to the state-of-the-art ensemble NNs; the effect of dataset size on NN performance was also investigated. The proposed framework was applied for the prediction of compressive strength (CS) of femoral trabecular bone in patients suffering from severe osteoarthritis. The NN model was able to estimate the CS of osteoarthritic trabecular bone from its structural and biological properties with a standard error of 0.85MPa. When evaluated on independent test samples, the NN achieved accuracy of 98.3%, outperforming an ensemble NN model by 11%. We reproduce this result on CS data of another porous solid (concrete) and demonstrate that the proposed framework allows for an NN modelled with as few as 56 samples to generalise on 300 independent test samples with 86.5% accuracy, which is comparable to the performance of an NN developed with 18 times larger dataset (1030 samples). The significance of this work is two-fold: the practical application allows for non-destructive prediction of bone fracture risk, while the novel methodology extends beyond the task considered in this study and provides a general framework for application of regression NNs to medical problems characterised by limited dataset sizes. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
ProDaMa: an open source Python library to generate protein structure datasets

Directory of Open Access Journals (Sweden)

Manconi Andrea

2009-10-01

Full Text Available Abstract Background The huge difference between the number of known sequences and known tertiary structures has justified the use of automated methods for protein analysis. Although a general methodology to solve these problems has not been yet devised, researchers are engaged in developing more accurate techniques and algorithms whose training plays a relevant role in determining their performance. From this perspective, particular importance is given to the training data used in experiments, and researchers are often engaged in the generation of specialized datasets that meet their requirements. Findings To facilitate the task of generating specialized datasets we devised and implemented ProDaMa, an open source Python library than provides classes for retrieving, organizing, updating, analyzing, and filtering protein data. Conclusion ProDaMa has been used to generate specialized datasets useful for secondary structure prediction and to develop a collaborative web application aimed at generating and sharing protein structure datasets. The library, the related database, and the documentation are freely available at the URL http://iasc.diee.unica.it/prodama.

Elsevier’s approach to the bioCADDIE 2016 Dataset Retrieval Challenge

Science.gov (United States)

Scerri, Antony; Kuriakose, John; Deshmane, Amit Ajit; Stanger, Mark; Moore, Rebekah; Naik, Raj; de Waard, Anita

2017-01-01

Abstract We developed a two-stream, Apache Solr-based information retrieval system in response to the bioCADDIE 2016 Dataset Retrieval Challenge. One stream was based on the principle of word embeddings, the other was rooted in ontology based indexing. Despite encountering several issues in the data, the evaluation procedure and the technologies used, the system performed quite well. We provide some pointers towards future work: in particular, we suggest that more work in query expansion could benefit future biomedical search engines. Database URL: https://data.mendeley.com/datasets/zd9dxpyybg/1 PMID:29220454
PLANETarium - Visualizing Earth Sciences in the Planetarium

Science.gov (United States)

Ballmer, M. D.; Wiethoff, T.; Kraupe, T. W.

2013-12-01

In the past decade, projection systems in most planetariums, traditional sites of outreach and public education, have advanced from instruments that can visualize the motion of stars as beam spots moving over spherical projection areas to systems that are able to display multicolor, high-resolution, immersive full-dome videos or images. These extraordinary capabilities are ideally suited for visualization of global processes occurring on the surface and within the interior of the Earth, a spherical body just as the full dome. So far, however, our community has largely ignored this wonderful interface for outreach and education. A few documentaries on e.g. climate change or volcanic eruptions have been brought to planetariums, but are taking little advantage of the true potential of the medium, as mostly based on standard two-dimensional videos and cartoon-style animations. Along these lines, we here propose a framework to convey recent scientific results on the origin and evolution of our PLANET to the >100,000,000 per-year worldwide audience of planetariums, making the traditionally astronomy-focussed interface a true PLANETarium. In order to do this most efficiently, we intend to directly show visualizations of scientific datasets or models, originally designed for basic research. Such visualizations in solid-Earth, as well as athmospheric and ocean sciences, are expected to be renderable to the dome with little or no effort. For example, showing global geophysical datasets (e.g., surface temperature, gravity, magnetic field), or horizontal slices of seismic-tomography images and of spherical computer simulations (e.g., climate evolution, mantle flow or ocean currents) requires almost no rendering at all. Three-dimensional Cartesian datasets or models can be rendered using standard methods. With the appropriate audio support, present-day science visualizations are typically as intuitive as cartoon-style animations, yet more appealing visually, and clearly more
Multimedia Content Development as a Facial Expression Datasets for Recognition of Human Emotions

Science.gov (United States)

Mamonto, N. E.; Maulana, H.; Liliana, D. Y.; Basaruddin, T.

2018-02-01

Datasets that have been developed before contain facial expression from foreign people. The development of multimedia content aims to answer the problems experienced by the research team and other researchers who will conduct similar research. The method used in the development of multimedia content as facial expression datasets for human emotion recognition is the Villamil-Molina version of the multimedia development method. Multimedia content developed with 10 subjects or talents with each talent performing 3 shots with each capturing talent having to demonstrate 19 facial expressions. After the process of editing and rendering, tests are carried out with the conclusion that the multimedia content can be used as a facial expression dataset for recognition of human emotions.
Microscopy Image Browser: A Platform for Segmentation and Analysis of Multidimensional Datasets.

Directory of Open Access Journals (Sweden)

Ilya Belevich

2016-01-01

Full Text Available Understanding the structure-function relationship of cells and organelles in their natural context requires multidimensional imaging. As techniques for multimodal 3-D imaging have become more accessible, effective processing, visualization, and analysis of large datasets are posing a bottleneck for the workflow. Here, we present a new software package for high-performance segmentation and image processing of multidimensional datasets that improves and facilitates the full utilization and quantitative analysis of acquired data, which is freely available from a dedicated website. The open-source environment enables modification and insertion of new plug-ins to customize the program for specific needs. We provide practical examples of program features used for processing, segmentation and analysis of light and electron microscopy datasets, and detailed tutorials to enable users to rapidly and thoroughly learn how to use the program.
Advanced Neuropsychological Diagnostics Infrastructure (ANDI: A Normative Database Created from Control Datasets.

Directory of Open Access Journals (Sweden)

Nathalie R. de Vent

2016-10-01

Full Text Available In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI, datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given.
Resolution testing and limitations of geodetic and tsunami datasets for finite fault inversions along subduction zones

Science.gov (United States)

Williamson, A.; Newman, A. V.

2017-12-01

Finite fault inversions utilizing multiple datasets have become commonplace for large earthquakes pending data availability. The mixture of geodetic datasets such as Global Navigational Satellite Systems (GNSS) and InSAR, seismic waveforms, and when applicable, tsunami waveforms from Deep-Ocean Assessment and Reporting of Tsunami (DART) gauges, provide slightly different observations that when incorporated together lead to a more robust model of fault slip distribution. The merging of different datasets is of particular importance along subduction zones where direct observations of seafloor deformation over the rupture area are extremely limited. Instead, instrumentation measures related ground motion from tens to hundreds of kilometers away. The distance from the event and dataset type can lead to a variable degree of resolution, affecting the ability to accurately model the spatial distribution of slip. This study analyzes the spatial resolution attained individually from geodetic and tsunami datasets as well as in a combined dataset. We constrain the importance of distance between estimated parameters and observed data and how that varies between land-based and open ocean datasets. Analysis focuses on accurately scaled subduction zone synthetic models as well as analysis of the relationship between slip and data in recent large subduction zone earthquakes. This study shows that seafloor deformation sensitive datasets, like open-ocean tsunami waveforms or seafloor geodetic instrumentation, can provide unique offshore resolution for understanding most large and particularly tsunamigenic megathrust earthquake activity. In most environments, we simply lack the capability to resolve static displacements using land-based geodetic observations.
Dataset of Phenology of Mediterranean high-mountain meadows flora (Sierra Nevada, Spain)

OpenAIRE

Antonio Jesús Pérez-Luque; Cristina Patricia Sánchez-Rojas; Regino Zamora; Ramón Pérez-Pérez; Francisco Javier Bonet

2015-01-01

Abstract Sierra Nevada mountain range (southern Spain) hosts a high number of endemic plant species, being one of the most important biodiversity hotspots in the Mediterranean basin. The high-mountain meadow ecosystems (borreguiles) harbour a large number of endemic and threatened plant species. In this data paper, we describe a dataset of the flora inhabiting this threatened ecosystem in this Mediterranean mountain. The dataset includes occurrence data for flora collected in those ecosystems...
Accounting for inertia in modal choices: some new evidence using a RP/SP dataset

DEFF Research Database (Denmark)

Cherchi, Elisabetta; Manca, Francesco

2011-01-01

effect is stable along the SP experiments. Inertia has been studied more extensively with panel datasets, but few investigations have used RP/SP datasets. In this paper we extend previous work in several ways. We test and compare several ways of measuring inertia, including measures that have been...... proposed for both short and long RP panel datasets. We also explore new measures of inertia to test for the effect of “learning” (in the sense of acquiring experience or getting more familiar with) along the SP experiment and we disentangle this effect from the pure inertia effect. A mixed logit model...... is used that allows us to account for both systematic and random taste variations in the inertia effect and for correlations among RP and SP observations. Finally we explore the relation between the utility specification (especially in the SP dataset) and the role of inertia in explaining current choices....
New fuzzy support vector machine for the class imbalance problem in medical datasets classification.

Science.gov (United States)

Gu, Xiaoqing; Ni, Tongguang; Wang, Hongyuan

2014-01-01

In medical datasets classification, support vector machine (SVM) is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM) for the class imbalance problem (called FSVM-CIP) is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.
New Fuzzy Support Vector Machine for the Class Imbalance Problem in Medical Datasets Classification

Directory of Open Access Journals (Sweden)

Xiaoqing Gu

2014-01-01

Full Text Available In medical datasets classification, support vector machine (SVM is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM for the class imbalance problem (called FSVM-CIP is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.
The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 Catchments (Version 2.1) for the Conterminous United States: National Coal Resource Dataset System

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset represents the coal mine density and storage volumes within individual, local NHDPlusV2 catchments and upstream, contributing watersheds based on the...
BIA Indian Lands Dataset (Indian Lands of the United States)

Data.gov (United States)

Federal Geographic Data Committee — The American Indian Reservations / Federally Recognized Tribal Entities dataset depicts feature location, selected demographics and other associated data for the 561...
Wehmas et al. 94-04 Toxicol Sci: Datasets for manuscript

Data.gov (United States)

U.S. Environmental Protection Agency — Dataset includes overview text document (accepted version of manuscript) and tables, figures, and supplementary materials. Supplementary tables provide summary data...
Climate Forcing Datasets for Agricultural Modeling: Merged Products for Gap-Filling and Historical Climate Series Estimation

Science.gov (United States)

Ruane, Alex C.; Goldberg, Richard; Chryssanthacopoulos, James

2014-01-01

The AgMERRA and AgCFSR climate forcing datasets provide daily, high-resolution, continuous, meteorological series over the 1980-2010 period designed for applications examining the agricultural impacts of climate variability and climate change. These datasets combine daily resolution data from retrospective analyses (the Modern-Era Retrospective Analysis for Research and Applications, MERRA, and the Climate Forecast System Reanalysis, CFSR) with in situ and remotely-sensed observational datasets for temperature, precipitation, and solar radiation, leading to substantial reductions in bias in comparison to a network of 2324 agricultural-region stations from the Hadley Integrated Surface Dataset (HadISD). Results compare favorably against the original reanalyses as well as the leading climate forcing datasets (Princeton, WFD, WFD-EI, and GRASP), and AgMERRA distinguishes itself with substantially improved representation of daily precipitation distributions and extreme events owing to its use of the MERRA-Land dataset. These datasets also peg relative humidity to the maximum temperature time of day, allowing for more accurate representation of the diurnal cycle of near-surface moisture in agricultural models. AgMERRA and AgCFSR enable a number of ongoing investigations in the Agricultural Model Intercomparison and Improvement Project (AgMIP) and related research networks, and may be used to fill gaps in historical observations as well as a basis for the generation of future climate scenarios.
Knowledge discovery with classification rules in a cardiovascular dataset.

Science.gov (United States)

Podgorelec, Vili; Kokol, Peter; Stiglic, Milojka Molan; Hericko, Marjan; Rozman, Ivan

2005-12-01

In this paper we study an evolutionary machine learning approach to data mining and knowledge discovery based on the induction of classification rules. A method for automatic rules induction called AREX using evolutionary induction of decision trees and automatic programming is introduced. The proposed algorithm is applied to a cardiovascular dataset consisting of different groups of attributes which should possibly reveal the presence of some specific cardiovascular problems in young patients. A case study is presented that shows the use of AREX for the classification of patients and for discovering possible new medical knowledge from the dataset. The defined knowledge discovery loop comprises a medical expert's assessment of induced rules to drive the evolution of rule sets towards more appropriate solutions. The final result is the discovery of a possible new medical knowledge in the field of pediatric cardiology.
BLOND, a building-level office environment dataset of typical electrical appliances

Science.gov (United States)

Kriechbaumer, Thomas; Jacobsen, Hans-Arno

2018-03-01

Energy metering has gained popularity as conventional meters are replaced by electronic smart meters that promise energy savings and higher comfort levels for occupants. Achieving these goals requires a deeper understanding of consumption patterns to reduce the energy footprint: load profile forecasting, power disaggregation, appliance identification, startup event detection, etc. Publicly available datasets are used to test, verify, and benchmark possible solutions to these problems. For this purpose, we present the BLOND dataset: continuous energy measurements of a typical office environment at high sampling rates with common appliances and load profiles. We provide voltage and current readings for aggregated circuits and matching fully-labeled ground truth data (individual appliance measurements). The dataset contains 53 appliances (16 classes) in a 3-phase power grid. BLOND-50 contains 213 days of measurements sampled at 50kSps (aggregate) and 6.4kSps (individual appliances). BLOND-250 consists of the same setup: 50 days, 250kSps (aggregate), 50kSps (individual appliances). These are the longest continuous measurements at such high sampling rates and fully-labeled ground truth we are aware of.
BLOND, a building-level office environment dataset of typical electrical appliances.

Science.gov (United States)

Kriechbaumer, Thomas; Jacobsen, Hans-Arno

2018-03-27

Energy metering has gained popularity as conventional meters are replaced by electronic smart meters that promise energy savings and higher comfort levels for occupants. Achieving these goals requires a deeper understanding of consumption patterns to reduce the energy footprint: load profile forecasting, power disaggregation, appliance identification, startup event detection, etc. Publicly available datasets are used to test, verify, and benchmark possible solutions to these problems. For this purpose, we present the BLOND dataset: continuous energy measurements of a typical office environment at high sampling rates with common appliances and load profiles. We provide voltage and current readings for aggregated circuits and matching fully-labeled ground truth data (individual appliance measurements). The dataset contains 53 appliances (16 classes) in a 3-phase power grid. BLOND-50 contains 213 days of measurements sampled at 50kSps (aggregate) and 6.4kSps (individual appliances). BLOND-250 consists of the same setup: 50 days, 250kSps (aggregate), 50kSps (individual appliances). These are the longest continuous measurements at such high sampling rates and fully-labeled ground truth we are aware of.
Geochemical Fingerprinting of Coltan Ores by Machine Learning on Uneven Datasets

International Nuclear Information System (INIS)

Savu-Krohn, Christian; Rantitsch, Gerd; Auer, Peter; Melcher, Frank; Graupner, Torsten

2011-01-01

Two modern machine learning techniques, Linear Programming Boosting (LPBoost) and Support Vector Machines (SVMs), are introduced and applied to a geochemical dataset of niobium–tantalum (“coltan”) ores from Central Africa to demonstrate how such information may be used to distinguish ore provenance, i.e., place of origin. The compositional data used include uni- and multivariate outliers and elemental distributions are not described by parametric frequency distribution functions. The “soft margin” techniques of LPBoost and SVMs can be applied to such data. Optimization of their learning parameters results in an average accuracy of up to c. 92%, if spot measurements are assessed to estimate the provenance of ore samples originating from two geographically defined source areas. A parameterized performance measure, together with common methods for its optimization, was evaluated to account for the presence of uneven datasets. Optimization of the classification function threshold improves the performance, as class importance is shifted towards one of those classes. For this dataset, the average performance of the SVMs is significantly better compared to that of LPBoost.
Outcomes from the GLEON fellowship program. Training graduate students in data driven network science.

Science.gov (United States)

Dugan, H.; Hanson, P. C.; Weathers, K. C.

2016-12-01

In the water sciences there is a massive need for graduate students who possess the analytical and technical skills to deal with large datasets and function in the new paradigm of open, collaborative -science. The Global Lake Ecological Observatory Network (GLEON) graduate fellowship program (GFP) was developed as an interdisciplinary training program to supplement the intensive disciplinary training of traditional graduate education. The primary goal of the GFP was to train a diverse cohort of graduate students in network science, open-web technologies, collaboration, and data analytics, and importantly to provide the opportunity to use these skills to conduct collaborative research resulting in publishable scientific products. The GFP is run as a series of three week-long workshops over two years that brings together a cohort of twelve students. In addition, fellows are expected to attend and contribute to at least one international GLEON all-hands' meeting. Here, we provide examples of training modules in the GFP (model building, data QA/QC, information management, bayesian modeling, open coding/version control, national data programs), as well as scientific outputs (manuscripts, software products, and new global datasets) produced by the fellows, as well as the process by which this team science was catalyzed. Data driven education that lets students apply learned skills to real research projects reinforces concepts, provides motivation, and can benefit their publication record. This program design is extendable to other institutions and networks.
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

Data.gov (United States)

National Aeronautics and Space Administration — MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract....

A method for generating large datasets of organ geometries for radiotherapy treatment planning studies

International Nuclear Information System (INIS)

Hu, Nan; Cerviño, Laura; Segars, Paul; Lewis, John; Shan, Jinlu; Jiang, Steve; Zheng, Xiaolin; Wang, Ge

2014-01-01

With the rapidly increasing application of adaptive radiotherapy, large datasets of organ geometries based on the patient’s anatomy are desired to support clinical application or research work, such as image segmentation, re-planning, and organ deformation analysis. Sometimes only limited datasets are available in clinical practice. In this study, we propose a new method to generate large datasets of organ geometries to be utilized in adaptive radiotherapy. Given a training dataset of organ shapes derived from daily cone-beam CT, we align them into a common coordinate frame and select one of the training surfaces as reference surface. A statistical shape model of organs was constructed, based on the establishment of point correspondence between surfaces and non-uniform rational B-spline (NURBS) representation. A principal component analysis is performed on the sampled surface points to capture the major variation modes of each organ. A set of principal components and their respective coefficients, which represent organ surface deformation, were obtained, and a statistical analysis of the coefficients was performed. New sets of statistically equivalent coefficients can be constructed and assigned to the principal components, resulting in a larger geometry dataset for the patient’s organs. These generated organ geometries are realistic and statistically representative
Sparse multivariate measures of similarity between intra-modal neuroimaging datasets

Directory of Open Access Journals (Sweden)

Maria J. Rosa

2015-10-01

Full Text Available An increasing number of neuroimaging studies are now based on either combining more than one data modality (inter-modal or combining more than one measurement from the same modality (intra-modal. To date, most intra-modal studies using multivariate statistics have focused on differences between datasets, for instance relying on classifiers to differentiate between effects in the data. However, to fully characterize these effects, multivariate methods able to measure similarities between datasets are needed. One classical technique for estimating the relationship between two datasets is canonical correlation analysis (CCA. However, in the context of high-dimensional data the application of CCA is extremely challenging. A recent extension of CCA, sparse CCA (SCCA, overcomes this limitation, by regularizing the model parameters while yielding a sparse solution. In this work, we modify SCCA with the aim of facilitating its application to high-dimensional neuroimaging data and finding meaningful multivariate image-to-image correspondences in intra-modal studies. In particular, we show how the optimal subset of variables can be estimated independently and we look at the information encoded in more than one set of SCCA transformations. We illustrate our framework using Arterial Spin Labelling data to investigate multivariate similarities between the effects of two antipsychotic drugs on cerebral blood flow.
Spatio-Temporal Data Model for Integrating Evolving Nation-Level Datasets

Science.gov (United States)

Sorokine, A.; Stewart, R. N.

2017-10-01

Ability to easily combine the data from diverse sources in a single analytical workflow is one of the greatest promises of the Big Data technologies. However, such integration is often challenging as datasets originate from different vendors, governments, and research communities that results in multiple incompatibilities including data representations, formats, and semantics. Semantics differences are hardest to handle: different communities often use different attribute definitions and associate the records with different sets of evolving geographic entities. Analysis of global socioeconomic variables across multiple datasets over prolonged time is often complicated by the difference in how boundaries and histories of countries or other geographic entities are represented. Here we propose an event-based data model for depicting and tracking histories of evolving geographic units (countries, provinces, etc.) and their representations in disparate data. The model addresses the semantic challenge of preserving identity of geographic entities over time by defining criteria for the entity existence, a set of events that may affect its existence, and rules for mapping between different representations (datasets). Proposed model is used for maintaining an evolving compound database of global socioeconomic and environmental data harvested from multiple sources. Practical implementation of our model is demonstrated using PostgreSQL object-relational database with the use of temporal, geospatial, and NoSQL database extensions.
Ecohydrological Index, Native Fish, and Climate Trends and Relationships in the Kansas River Basin_dataset

Data.gov (United States)

U.S. Environmental Protection Agency — The dataset is an excel file that contain data for the figures in the manuscript. This dataset is associated with the following publication: Sinnathamby, S., K....
Science Support: The Building Blocks of Active Data Curation

Science.gov (United States)

Guillory, A.

2013-12-01

While the scientific method is built on reproducibility and transparency, and results are published in peer reviewed literature, we have come to the digital age of very large datasets (now of the order of petabytes and soon exabytes) which cannot be published in the traditional way. To preserve reproducibility and transparency, active curation is necessary to keep and protect the information in the long term, and 'science support' activities provide the building blocks for active data curation. With the explosive growth of data in all fields in recent years, there is a pressing urge for data centres to now provide adequate services to ensure long-term preservation and digital curation of project data outputs, however complex those may be. Science support provides advice and support to science projects on data and information management, from file formats through to general data management awareness. Another purpose of science support is to raise awareness in the science community of data and metadata standards and best practice, engendering a culture where data outputs are seen as valued assets. At the heart of Science support is the Data Management Plan (DMP) which sets out a coherent approach to data issues pertaining to the data generating project. It provides an agreed record of the data management needs and issues within the project. The DMP is agreed upon with project investigators to ensure that a high quality documented data archive is created. It includes conditions of use and deposit to clearly express the ownership, responsibilities and rights associated with the data. Project specific needs are also identified for data processing, visualization tools and data sharing services. As part of the National Centre for Atmospheric Science (NCAS) and National Centre for Earth Observation (NCEO), the Centre for Environmental Data Archival (CEDA) fulfills this science support role of facilitating atmospheric and Earth observation data generating projects to ensure
Integrated remotely sensed datasets for disaster management

Science.gov (United States)

McCarthy, Timothy; Farrell, Ronan; Curtis, Andrew; Fotheringham, A. Stewart

2008-10-01

Video imagery can be acquired from aerial, terrestrial and marine based platforms and has been exploited for a range of remote sensing applications over the past two decades. Examples include coastal surveys using aerial video, routecorridor infrastructures surveys using vehicle mounted video cameras, aerial surveys over forestry and agriculture, underwater habitat mapping and disaster management. Many of these video systems are based on interlaced, television standards such as North America's NTSC and European SECAM and PAL television systems that are then recorded using various video formats. This technology has recently being employed as a front-line, remote sensing technology for damage assessment post-disaster. This paper traces the development of spatial video as a remote sensing tool from the early 1980s to the present day. The background to a new spatial-video research initiative based at National University of Ireland, Maynooth, (NUIM) is described. New improvements are proposed and include; low-cost encoders, easy to use software decoders, timing issues and interoperability. These developments will enable specialists and non-specialists collect, process and integrate these datasets within minimal support. This integrated approach will enable decision makers to access relevant remotely sensed datasets quickly and so, carry out rapid damage assessment during and post-disaster.
FTSPlot: fast time series visualization for large datasets.

Directory of Open Access Journals (Sweden)

Michael Riss

Full Text Available The analysis of electrophysiological recordings often involves visual inspection of time series data to locate specific experiment epochs, mask artifacts, and verify the results of signal processing steps, such as filtering or spike detection. Long-term experiments with continuous data acquisition generate large amounts of data. Rapid browsing through these massive datasets poses a challenge to conventional data plotting software because the plotting time increases proportionately to the increase in the volume of data. This paper presents FTSPlot, which is a visualization concept for large-scale time series datasets using techniques from the field of high performance computer graphics, such as hierarchic level of detail and out-of-core data handling. In a preprocessing step, time series data, event, and interval annotations are converted into an optimized data format, which then permits fast, interactive visualization. The preprocessing step has a computational complexity of O(n x log(N; the visualization itself can be done with a complexity of O(1 and is therefore independent of the amount of data. A demonstration prototype has been implemented and benchmarks show that the technology is capable of displaying large amounts of time series data, event, and interval annotations lag-free with < 20 ms ms. The current 64-bit implementation theoretically supports datasets with up to 2(64 bytes, on the x86_64 architecture currently up to 2(48 bytes are supported, and benchmarks have been conducted with 2(40 bytes/1 TiB or 1.3 x 10(11 double precision samples. The presented software is freely available and can be included as a Qt GUI component in future software projects, providing a standard visualization method for long-term electrophysiological experiments.
Designing the colorectal cancer core dataset in Iran

Directory of Open Access Journals (Sweden)

Sara Dorri

2017-01-01

Full Text Available Background: There is no need to explain the importance of collection, recording and analyzing the information of disease in any health organization. In this regard, systematic design of standard data sets can be helpful to record uniform and consistent information. It can create interoperability between health care systems. The main purpose of this study was design the core dataset to record colorectal cancer information in Iran. Methods: For the design of the colorectal cancer core data set, a combination of literature review and expert consensus were used. In the first phase, the draft of the data set was designed based on colorectal cancer literature review and comparative studies. Then, in the second phase, this data set was evaluated by experts from different discipline such as medical informatics, oncology and surgery. Their comments and opinion were taken. In the third phase refined data set, was evaluated again by experts and eventually data set was proposed. Results: In first phase, based on the literature review, a draft set of 85 data elements was designed. In the second phase this data set was evaluated by experts and supplementary information was offered by professionals in subgroups especially in treatment part. In this phase the number of elements totally were arrived to 93 numbers. In the third phase, evaluation was conducted by experts and finally this dataset was designed in five main parts including: demographic information, diagnostic information, treatment information, clinical status assessment information, and clinical trial information. Conclusion: In this study the comprehensive core data set of colorectal cancer was designed. This dataset in the field of collecting colorectal cancer information can be useful through facilitating exchange of health information. Designing such data set for similar disease can help providers to collect standard data from patients and can accelerate retrieval from storage systems.
Do Higher Government Wages Reduce Corruption? Evidence Based on a Novel Dataset

OpenAIRE

Le, Van-Ha; de Haan, Jakob; Dietzenbacher, Erik

2013-01-01

This paper employs a novel dataset on government wages to investigate the relationship between government remuneration policy and corruption. Our dataset, as derived from national household or labor surveys, is more reliable than the data on government wages as used in previous research. When the relationship between government wages and corruption is modeled to vary with the level of income, we find that the impact of government wages on corruption is strong at relatively low-income levels.
Dataset-driven research for improving recommender systems for learning

NARCIS (Netherlands)

Verbert, Katrien; Drachsler, Hendrik; Manouselis, Nikos; Wolpers, Martin; Vuorikari, Riina; Duval, Erik

2011-01-01

Verbert, K., Drachsler, H., Manouselis, N., Wolpers, M., Vuorikari, R., & Duval, E. (2011). Dataset-driven research for improving recommender systems for learning. In Ph. Long, & G. Siemens (Eds.), Proceedings of 1st International Conference Learning Analytics & Knowledge (pp. 44-53). February,
GRIP: A web-based system for constructing Gold Standard datasets for protein-protein interaction prediction

Directory of Open Access Journals (Sweden)

Zheng Huiru

2009-01-01

Full Text Available Abstract Background Information about protein interaction networks is fundamental to understanding protein function and cellular processes. Interaction patterns among proteins can suggest new drug targets and aid in the design of new therapeutic interventions. Efforts have been made to map interactions on a proteomic-wide scale using both experimental and computational techniques. Reference datasets that contain known interacting proteins (positive cases and non-interacting proteins (negative cases are essential to support computational prediction and validation of protein-protein interactions. Information on known interacting and non interacting proteins are usually stored within databases. Extraction of these data can be both complex and time consuming. Although, the automatic construction of reference datasets for classification is a useful resource for researchers no public resource currently exists to perform this task. Results GRIP (Gold Reference dataset constructor from Information on Protein complexes is a web-based system that provides researchers with the functionality to create reference datasets for protein-protein interaction prediction in Saccharomyces cerevisiae. Both positive and negative cases for a reference dataset can be extracted, organised and downloaded by the user. GRIP also provides an upload facility whereby users can submit proteins to determine protein complex membership. A search facility is provided where a user can search for protein complex information in Saccharomyces cerevisiae. Conclusion GRIP is developed to retrieve information on protein complex, cellular localisation, and physical and genetic interactions in Saccharomyces cerevisiae. Manual construction of reference datasets can be a time consuming process requiring programming knowledge. GRIP simplifies and speeds up this process by allowing users to automatically construct reference datasets. GRIP is free to access at http://rosalind.infj.ulst.ac.uk/GRIP/.
Developing predictive imaging biomarkers using whole-brain classifiers: Application to the ABIDE I dataset

Directory of Open Access Journals (Sweden)

Swati Rane

2017-03-01

Full Text Available We designed a modular machine learning program that uses functional magnetic resonance imaging (fMRI data in order to distinguish individuals with autism spectrum disorders from neurodevelopmentally normal individuals. Data was selected from the Autism Brain Imaging Dataset Exchange (ABIDE I Preprocessed Dataset.
A Position Statement on Population Data Science:

Directory of Open Access Journals (Sweden)

Kim McGrail

2018-02-01

Full Text Available Information is increasingly digital, creating opportunities to respond to pressing issues about human populations in near real time using linked datasets that are large, complex, and diverse. The potential social and individual benefits that can come from data-intensive science are large, but raise challenges of balancing individual privacy and the public good, building appropriate socio-technical systems to support data-intensive science, and determining whether defining a new field of inquiry might help move those collective interests and activities forward. A combination of expert engagement, literature review, and iterative conversations led to our conclusion that defining the field of Population Data Science (challenge 3 will help address the other two challenges as well. We define Population Data Science succinctly as the science of data about people and note that it is related to but distinct from the fields of data science and informatics. A broader definition names four characteristics of: data use for positive impact on citizens and society; bringing together and analyzing data from multiple sources; finding population-level insights; and developing safe, privacy-sensitive and ethical infrastructure to support research. One implication of these characteristics is that few people possess all of the requisite knowledge and skills of Population Data Science, so this is by nature a multi-disciplinary field. Other implications include the need to advance various aspects of science, such as data linkage technology, various forms of analytics, and methods of public engagement. These implications are the beginnings of a research agenda for Population Data Science, which if approached as a collective field, can catalyze significant advances in our understanding of trends in society, health, and human behavior.
[Research on developping the spectral dataset for Dunhuang typical colors based on color constancy].

Science.gov (United States)

Liu, Qiang; Wan, Xiao-Xia; Liu, Zhen; Li, Chan; Liang, Jin-Xing

2013-11-01

The present paper aims at developping a method to reasonably set up the typical spectral color dataset for different kinds of Chinese cultural heritage in color rendering process. The world famous wall paintings dating from more than 1700 years ago in Dunhuang Mogao Grottoes was taken as typical case in this research. In order to maintain the color constancy during the color rendering workflow of Dunhuang culture relics, a chromatic adaptation based method for developping the spectral dataset of typical colors for those wall paintings was proposed from the view point of human vision perception ability. Under the help and guidance of researchers in the art-research institution and protection-research institution of Dunhuang Academy and according to the existing research achievement of Dunhuang Research in the past years, 48 typical known Dunhuang pigments were chosen and 240 representative color samples were made with reflective spectral ranging from 360 to 750 nm was acquired by a spectrometer. In order to find the typical colors of the above mentioned color samples, the original dataset was devided into several subgroups by clustering analysis. The grouping number, together with the most typical samples for each subgroup which made up the firstly built typical color dataset, was determined by wilcoxon signed rank test according to the color inconstancy index comprehensively calculated under 6 typical illuminating conditions. Considering the completeness of gamut of Dunhuang wall paintings, 8 complementary colors was determined and finally the typical spectral color dataset was built up which contains 100 representative spectral colors. The analytical calculating results show that the median color inconstancy index of the built dataset in 99% confidence level by wilcoxon signed rank test was 3.28 and the 100 colors are distributing in the whole gamut uniformly, which ensures that this dataset can provide reasonable reference for choosing the color with highest
Statistical exploration of dataset examining key indicators influencing housing and urban infrastructure investments in megacities

Directory of Open Access Journals (Sweden)

Adedeji O. Afolabi

2018-06-01

Full Text Available Lagos, by the UN standards, has attained the megacity status, with the attendant challenges of living up to that titanic position; regrettably it struggles with its present stock of housing and infrastructural facilities to match its new status. Based on a survey of construction professionals’ perception residing within the state, a questionnaire instrument was used to gather the dataset. The statistical exploration contains dataset on the state of housing and urban infrastructural deficit, key indicators spurring the investment by government to upturn the deficit and improvement mechanisms to tackle the infrastructural dearth. Descriptive statistics and inferential statistics were used to present the dataset. The dataset when analyzed can be useful for policy makers, local and international governments, world funding bodies, researchers and infrastructural investors. Keywords: Construction, Housing, Megacities, Population, Urban infrastructures
Toxics Release Inventory Chemical Hazard Information Profiles (TRI-CHIP) Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — The Toxics Release Inventory (TRI) Chemical Hazard Information Profiles (TRI-CHIP) dataset contains hazard information about the chemicals reported in TRI. Users can...
Risk behaviours among internet-facilitated sex workers: evidence from two new datasets.

Science.gov (United States)

Cunningham, Scott; Kendall, Todd D

2010-12-01

Sex workers have historically played a central role in STI outbreaks by forming a core group for transmission and due to their higher rates of concurrency and inconsistent condom usage. Over the past 15 years, North American commercial sex markets have been radically reorganised by internet technologies that channelled a sizeable share of the marketplace online. These changes may have had a meaningful impact on the role that sex workers play in STI epidemics. In this study, two new datasets documenting the characteristics and practices of internet-facilitated sex workers are presented and analysed. The first dataset comes from a ratings website where clients share detailed information on over 94,000 sex workers in over 40 cities between 1999 and 2008. The second dataset reflects a year-long field survey of 685 sex workers who advertise online. Evidence from these datasets suggests that internet-facilitated sex workers are dissimilar from the street-based workers who largely populated the marketplace in earlier eras. Differences in characteristics and practices were found which suggest a lower potential for the spread of STIs among internet-facilitated sex workers. The internet-facilitated population appears to include a high proportion of sex workers who are well-educated, hold health insurance and operate only part time. They also engage in relatively low levels of risky sexual practices.
Extraction of drainage networks from large terrain datasets using high throughput computing

Science.gov (United States)

Gong, Jianya; Xie, Jibo

2009-02-01

Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.
Facing the Challenges of Accessing, Managing, and Integrating Large Observational Datasets in Ecology: Enabling and Enriching the Use of NEON's Observational Data

Science.gov (United States)

Thibault, K. M.

2013-12-01

As the construction of NEON and its transition to operations progresses, more and more data will become available to the scientific community, both from NEON directly and from the concomitant growth of existing data repositories. Many of these datasets include ecological observations of a diversity of taxa in both aquatic and terrestrial environments. Although observational data have been collected and used throughout the history of organismal biology, the field has not yet fully developed a culture of data management, documentation, standardization, sharing and discoverability to facilitate the integration and synthesis of datasets. Moreover, the tools required to accomplish these goals, namely database design, implementation, and management, and automation and parallelization of analytical tasks through computational techniques, have not historically been included in biology curricula, at either the undergraduate or graduate levels. To ensure the success of data-generating projects like NEON in advancing organismal ecology and to increase transparency and reproducibility of scientific analyses, an acceleration of the cultural shift to open science practices, the development and adoption of data standards, such as the DarwinCore standard for taxonomic data, and increased training in computational approaches for biologists need to be realized. Here I highlight several initiatives that are intended to increase access to and discoverability of publicly available datasets and equip biologists and other scientists with the skills that are need to manage, integrate, and analyze data from multiple large-scale projects. The EcoData Retriever (ecodataretriever.org) is a tool that downloads publicly available datasets, re-formats the data into an efficient relational database structure, and then automatically imports the data tables onto a user's local drive into the database tool of the user's choice. The automation of these tasks results in nearly instantaneous execution
The Climate Hazards group InfraRed Precipitation with Stations (CHIRPS) dataset and its applications in drought risk management

Science.gov (United States)

Shukla, Shraddhanand; Funk, Chris; Peterson, Pete; McNally, Amy; Dinku, Tufa; Barbosa, Humberto; Paredes-Trejo, Franklin; Pedreros, Diego; Husak, Greg

2017-04-01

A high quality, long-term, high-resolution precipitation dataset is key for supporting drought-related risk management and food security early warning. Here, we present the Climate Hazards group InfraRed Precipitation with Stations (CHIRPS) v2.0, developed by scientists at the University of California, Santa Barbara and the U.S. Geological Survey Earth Resources Observation and Science Center under the direction of Famine Early Warning Systems Network (FEWS NET). CHIRPS is a quasi-global precipitation product and is made available at daily to seasonal time scales with a spatial resolution of 0.05° and a 1981 to near real-time period of record. We begin by describing the three main components of CHIRPS - a high-resolution climatology, time-varying cold cloud duration precipitation estimates, and in situ precipitation estimates, and how they are combined. We then present a validation of this dataset and describe how CHIRPS is being disseminated and used in different applications, such as large-scale hydrologic models and crop water balance models. Validation of CHIRPS has focused on comparisons with precipitation products with global coverage, long periods of record and near real-time availability such as CPC-Unified, CFS Reanalysis and ECMWF datasets and datasets such GPCC and GPCP that incorporate high quality in situ datasets from places such as Uganda, Colombia, and the Sahel. The CHIRPS is shown to have low systematic errors (bias) and low mean absolute errors. We find that CHIRPS performance appears quite similar to research quality products like the GPCC and GPCP, but with higher resolution and lower latency. We also present results from independent validation studies focused on South America and East Africa. CHIRPS is currently being used to drive FEWS NET Land Data Assimilation System (FLDAS), that incorporates multiple hydrologic models, and Water Requirement Satisfaction Index (WRSI), which is a widely used crop water balance model. The outputs (such as

Dataset on predictive compressive strength model for self-compacting concrete.

Science.gov (United States)

Ofuyatan, O M; Edeki, S O

2018-04-01

The determination of compressive strength is affected by many variables such as the water cement (WC) ratio, the superplasticizer (SP), the aggregate combination, and the binder combination. In this dataset article, 7, 28, and 90-day compressive strength models are derived using statistical analysis. The response surface methodology is used toinvestigate the effect of the parameters: Varying percentages of ash, cement, WC, and SP on hardened properties-compressive strengthat 7,28 and 90 days. Thelevels of independent parameters are determinedbased on preliminary experiments. The experimental values for compressive strengthat 7, 28 and 90 days and modulus of elasticity underdifferent treatment conditions are also discussed and presented.These dataset can effectively be used for modelling and prediction in concrete production settings.
Redesigning the DOE Data Explorer to Embed Dataset Relationships at the Point of Search and to Reflect Landing Page Organization

Directory of Open Access Journals (Sweden)

Sara Studwell

2017-04-01

Full Text Available Scientific research is producing ever-increasing amounts of data. Organizing and reflecting relationships across data collections, datasets, publications, and other research objects are essential functionalities of the modern science environment, yet challenging to implement. Landing pages are often used for providing ‘big picture’ contextual frameworks for datasets and data collections, and many large-volume data holders are utilizing them in thoughtful, creative ways. The benefits of their organizational efforts, however, are not realized unless the user eventually sees the landing page at the end point of their search. What if that organization and ‘big picture’ context could benefit the user at the beginning of the search? That is a challenging approach, but The Department of Energy’s (DOE Office of Scientific and Technical Information (OSTI is redesigning the database functionality of the DOE Data Explorer (DDE with that goal in mind. Phase I is focused on redesigning the DDE database to leverage relationships between two existing distinct populations in DDE, data Projects and individual Datasets, and then adding a third intermediate population, data Collections. Mapped, structured linkages, designed to show user relationships, will allow users to make informed search choices. These linkages will be sustainable and scalable, created automatically with the use of new metadata fields and existing authorities. Phase II will study selected DOE Data ID Service clients, analyzing how their landing pages are organized, and how that organization might be used to improve DDE search capabilities. At the heart of both phases is the realization that adding more metadata information for cross-referencing may require additional effort for data scientists. OSTI’s approach seeks to leverage existing metadata and landing page intelligence without imposing an additional burden on the data creators.
The effects of spatial population dataset choice on estimates of population at risk of disease

Directory of Open Access Journals (Sweden)

Gething Peter W

2011-02-01

Full Text Available Abstract Background The spatial modeling of infectious disease distributions and dynamics is increasingly being undertaken for health services planning and disease control monitoring, implementation, and evaluation. Where risks are heterogeneous in space or dependent on person-to-person transmission, spatial data on human population distributions are required to estimate infectious disease risks, burdens, and dynamics. Several different modeled human population distribution datasets are available and widely used, but the disparities among them and the implications for enumerating disease burdens and populations at risk have not been considered systematically. Here, we quantify some of these effects using global estimates of populations at risk (PAR of P. falciparum malaria as an example. Methods The recent construction of a global map of P. falciparum malaria endemicity enabled the testing of different gridded population datasets for providing estimates of PAR by endemicity class. The estimated population numbers within each class were calculated for each country using four different global gridded human population datasets: GRUMP (~1 km spatial resolution, LandScan (~1 km, UNEP Global Population Databases (~5 km, and GPW3 (~5 km. More detailed assessments of PAR variation and accuracy were conducted for three African countries where census data were available at a higher administrative-unit level than used by any of the four gridded population datasets. Results The estimates of PAR based on the datasets varied by more than 10 million people for some countries, even accounting for the fact that estimates of population totals made by different agencies are used to correct national totals in these datasets and can vary by more than 5% for many low-income countries. In many cases, these variations in PAR estimates comprised more than 10% of the total national population. The detailed country-level assessments suggested that none of the datasets was
Bringing computational science to the public.

Science.gov (United States)

McDonagh, James L; Barker, Daniel; Alderson, Rosanna G

2016-01-01

The increasing use of computers in science allows for the scientific analyses of large datasets at an increasing pace. We provided examples and interactive demonstrations at Dundee Science Centre as part of the 2015 Women in Science festival, to present aspects of computational science to the general public. We used low-cost Raspberry Pi computers to provide hands on experience in computer programming and demonstrated the application of computers to biology. Computer games were used as a means to introduce computers to younger visitors. The success of the event was evaluated by voluntary feedback forms completed by visitors, in conjunction with our own self-evaluation. This work builds on the original work of the 4273π bioinformatics education program of Barker et al. (2013, BMC Bioinform. 14:243). 4273π provides open source education materials in bioinformatics. This work looks at the potential to adapt similar materials for public engagement events. It appears, at least in our small sample of visitors (n = 13), that basic computational science can be conveyed to people of all ages by means of interactive demonstrations. Children as young as five were able to successfully edit simple computer programs with supervision. This was, in many cases, their first experience of computer programming. The feedback is predominantly positive, showing strong support for improving computational science education, but also included suggestions for improvement. Our conclusions are necessarily preliminary. However, feedback forms suggest methods were generally well received among the participants; "Easy to follow. Clear explanation" and "Very easy. Demonstrators were very informative." Our event, held at a local Science Centre in Dundee, demonstrates that computer games and programming activities suitable for young children can be performed alongside a more specialised and applied introduction to computational science for older visitors.
AFSC/REFM: Seabird food habits dataset of the North Pacific

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — The seabird food habits dataset contains information on the stomach contents from seabird specimens that were collected under salvage and scientific collection...
Dataset of Phenology of Mediterranean high-mountain meadows flora (Sierra Nevada, Spain).

Science.gov (United States)

Pérez-Luque, Antonio Jesús; Sánchez-Rojas, Cristina Patricia; Zamora, Regino; Pérez-Pérez, Ramón; Bonet, Francisco Javier

2015-01-01

Sierra Nevada mountain range (southern Spain) hosts a high number of endemic plant species, being one of the most important biodiversity hotspots in the Mediterranean basin. The high-mountain meadow ecosystems (borreguiles) harbour a large number of endemic and threatened plant species. In this data paper, we describe a dataset of the flora inhabiting this threatened ecosystem in this Mediterranean mountain. The dataset includes occurrence data for flora collected in those ecosystems in two periods: 1988-1990 and 2009-2013. A total of 11002 records of occurrences belonging to 19 orders, 28 families 52 genera were collected. 73 taxa were recorded with 29 threatened taxa. We also included data of cover-abundance and phenology attributes for the records. The dataset is included in the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area.
Dataset of Phenology of Mediterranean high-mountain meadows flora (Sierra Nevada, Spain)

Science.gov (United States)

Pérez-Luque, Antonio Jesús; Sánchez-Rojas, Cristina Patricia; Zamora, Regino; Pérez-Pérez, Ramón; Bonet, Francisco Javier

2015-01-01

Abstract Sierra Nevada mountain range (southern Spain) hosts a high number of endemic plant species, being one of the most important biodiversity hotspots in the Mediterranean basin. The high-mountain meadow ecosystems (borreguiles) harbour a large number of endemic and threatened plant species. In this data paper, we describe a dataset of the flora inhabiting this threatened ecosystem in this Mediterranean mountain. The dataset includes occurrence data for flora collected in those ecosystems in two periods: 1988–1990 and 2009–2013. A total of 11002 records of occurrences belonging to 19 orders, 28 families 52 genera were collected. 73 taxa were recorded with 29 threatened taxa. We also included data of cover-abundance and phenology attributes for the records. The dataset is included in the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:25878552
Dataset on information strategies for energy conservation: A field experiment in India.

Science.gov (United States)

Chen, Victor L; Delmas, Magali A; Locke, Stephen L; Singh, Amarjeet

2018-02-01

The data presented in this article are related to the research article entitled: "Information strategies for energy conservation: a field experiment in India" (Chen et al., 2017) [1]. The availability of high-resolution electricity data offers benefits to both utilities and consumers to understand the dynamics of energy consumption for example, between billing periods or times of peak demand. However, few public datasets with high-temporal resolution have been available to researchers on electricity use, especially at the appliance-level. This article describes data collected in a residential field experiment for 19 apartments at an Indian faculty housing complex during the period from August 1, 2013 to May 12, 2014. The dataset includes detailed information about electricity consumption. It also includes information on apartment characteristics and hourly weather variation to enable further studies of energy performance. These data can be used by researchers as training datasets to evaluate electricity usage consumption.
Valuation of large variable annuity portfolios: Monte Carlo simulation and synthetic datasets

Directory of Open Access Journals (Sweden)

Gan Guojun

2017-12-01

Full Text Available Metamodeling techniques have recently been proposed to address the computational issues related to the valuation of large portfolios of variable annuity contracts. However, it is extremely diffcult, if not impossible, for researchers to obtain real datasets frominsurance companies in order to test their metamodeling techniques on such real datasets and publish the results in academic journals. To facilitate the development and dissemination of research related to the effcient valuation of large variable annuity portfolios, this paper creates a large synthetic portfolio of variable annuity contracts based on the properties of real portfolios of variable annuities and implements a simple Monte Carlo simulation engine for valuing the synthetic portfolio. In addition, this paper presents fair market values and Greeks for the synthetic portfolio of variable annuity contracts that are important quantities for managing the financial risks associated with variable annuities. The resulting datasets can be used by researchers to test and compare the performance of various metamodeling techniques.
A public dataset of overground and treadmill walking kinematics and kinetics in healthy individuals

Directory of Open Access Journals (Sweden)

Claudiane A. Fukuchi

2018-04-01

Full Text Available In a typical clinical gait analysis, the gait patterns of pathological individuals are commonly compared with the typically faster, comfortable pace of healthy subjects. However, due to potential bias related to gait speed, this comparison may not be valid. Publicly available gait datasets have failed to address this issue. Therefore, the goal of this study was to present a publicly available dataset of 42 healthy volunteers (24 young adults and 18 older adults who walked both overground and on a treadmill at a range of gait speeds. Their lower-extremity and pelvis kinematics were measured using a three-dimensional (3D motion-capture system. The external forces during both overground and treadmill walking were collected using force plates and an instrumented treadmill, respectively. The results include both raw and processed kinematic and kinetic data in different file formats: c3d and ASCII files. In addition, a metadata file is provided that contain demographic and anthropometric data and data related to each file in the dataset. All data are available at Figshare (DOI: 10.6084/m9.figshare.5722711. We foresee several applications of this public dataset, including to examine the influences of speed, age, and environment (overground vs. treadmill on gait biomechanics, to meet educational needs, and, with the inclusion of additional participants, to use as a normative dataset.
NASA UAV Airborne Science Capabilities in Support of Water Resource Management

Science.gov (United States)

Fladeland, Matthew

2015-01-01

This workshop presentation focuses on potential uses of unmanned aircraft observations in support of water resource management and agriculture. The presentation will provide an overview of NASA Airborne Science capabilities with an emphasis on past UAV missions to provide context on accomplishments as well as technical challenges. I will also focus on recent NASA Ames efforts to assist in irrigation management and invasive species management using airborne and satellite datasets.
Integration of geophysical datasets by a conjoint probability tomography approach: application to Italian active volcanic areas

Directory of Open Access Journals (Sweden)

D. Patella

2008-06-01

Full Text Available We expand the theory of probability tomography to the integration of different geophysical datasets. The aim of the new method is to improve the information quality using a conjoint occurrence probability function addressed to highlight the existence of common sources of anomalies. The new method is tested on gravity, magnetic and self-potential datasets collected in the volcanic area of Mt. Vesuvius (Naples, and on gravity and dipole geoelectrical datasets collected in the volcanic area of Mt. Etna (Sicily. The application demonstrates that, from a probabilistic point of view, the integrated analysis can delineate the signature of some important volcanic targets better than the analysis of the tomographic image of each dataset considered separately.
Data-Driven Decision Support for Radiologists: Re-using the National Lung Screening Trial Dataset for Pulmonary Nodule Management

OpenAIRE

Morrison, James J.; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L.

2014-01-01

Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was conve...
Comparision of analysis of the QTLMAS XII common dataset

DEFF Research Database (Denmark)

Lund, Mogens Sandø; Sahana, Goutam; de Koning, Dirk-Jan

2009-01-01

A dataset was simulated and distributed to participants of the QTLMAS XII workshop who were invited to develop genomic selection models. Each contributing group was asked to describe the model development and validation as well as to submit genomic predictions for three generations of individuals...
Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science through data reuse

Science.gov (United States)

Soranno, Patricia A.; Bissell, E.G.; Cheruvelil, Kendra S.; Christel, Samuel T.; Collins, Sarah M.; Fergus, C. Emi; Filstrup, Christopher T.; Lapierre, Jean-Francois; Lotting, Noah R.; Oliver, Samantha K.; Scott, Caren E.; Smith, Nicole J.; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A.; Gries, Corinna; Henry, Emily N.; Skaff, Nick K.; Stanley, Emily H.; Stow, Craig A.; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E.

2015-01-01

Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated
Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse.

Science.gov (United States)

Soranno, Patricia A; Bissell, Edward G; Cheruvelil, Kendra S; Christel, Samuel T; Collins, Sarah M; Fergus, C Emi; Filstrup, Christopher T; Lapierre, Jean-Francois; Lottig, Noah R; Oliver, Samantha K; Scott, Caren E; Smith, Nicole J; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A; Gries, Corinna; Henry, Emily N; Skaff, Nick K; Stanley, Emily H; Stow, Craig A; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E

2015-01-01

Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated
TerraClimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958-2015

Science.gov (United States)

Abatzoglou, John T.; Dobrowski, Solomon Z.; Parks, Sean A.; Hegewisch, Katherine C.

2018-01-01

We present TerraClimate, a dataset of high-spatial resolution (1/24°, ~4-km) monthly climate and climatic water balance for global terrestrial surfaces from 1958-2015. TerraClimate uses climatically aided interpolation, combining high-spatial resolution climatological normals from the WorldClim dataset, with coarser resolution time varying (i.e., monthly) data from other sources to produce a monthly dataset of precipitation, maximum and minimum temperature, wind speed, vapor pressure, and solar radiation. TerraClimate additionally produces monthly surface water balance datasets using a water balance model that incorporates reference evapotranspiration, precipitation, temperature, and interpolated plant extractable soil water capacity. These data provide important inputs for ecological and hydrological studies at global scales that require high spatial resolution and time varying climate and climatic water balance data. We validated spatiotemporal aspects of TerraClimate using annual temperature, precipitation, and calculated reference evapotranspiration from station data, as well as annual runoff from streamflow gauges. TerraClimate datasets showed noted improvement in overall mean absolute error and increased spatial realism relative to coarser resolution gridded datasets.
Gridded precipitation dataset for the Rhine basin made with the genRE interpolation method

NARCIS (Netherlands)

Osnabrugge, van B.; Uijlenhoet, R.

2017-01-01

A high resolution (1.2x1.2km) gridded precipitation dataset with hourly time step that covers the whole Rhine basin for the period 1997-2015. Made from gauge data with the genRE interpolation scheme. See "genRE: A method to extend gridded precipitation climatology datasets in near real-time for
Computational Methods for Large Spatio-temporal Datasets and Functional Data Ranking

KAUST Repository

Huang, Huang

2017-07-16

This thesis focuses on two topics, computational methods for large spatial datasets and functional data ranking. Both are tackling the challenges of big and high-dimensional data. The first topic is motivated by the prohibitive computational burden in fitting Gaussian process models to large and irregularly spaced spatial datasets. Various approximation methods have been introduced to reduce the computational cost, but many rely on unrealistic assumptions about the process and retaining statistical efficiency remains an issue. We propose a new scheme to approximate the maximum likelihood estimator and the kriging predictor when the exact computation is infeasible. The proposed method provides different types of hierarchical low-rank approximations that are both computationally and statistically efficient. We explore the improvement of the approximation theoretically and investigate the performance by simulations. For real applications, we analyze a soil moisture dataset with 2 million measurements with the hierarchical low-rank approximation and apply the proposed fast kriging to fill gaps for satellite images. The second topic is motivated by rank-based outlier detection methods for functional data. Compared to magnitude outliers, it is more challenging to detect shape outliers as they are often masked among samples. We develop a new notion of functional data depth by taking the integration of a univariate depth function. Having a form of the integrated depth, it shares many desirable features. Furthermore, the novel formation leads to a useful decomposition for detecting both shape and magnitude outliers. Our simulation studies show the proposed outlier detection procedure outperforms competitors in various outlier models. We also illustrate our methodology using real datasets of curves, images, and video frames. Finally, we introduce the functional data ranking technique to spatio-temporal statistics for visualizing and assessing covariance properties, such as
Introduction to modern Fortran for the Earth system sciences

CERN Document Server

Chirila, Dragos B

2014-01-01

This work provides a short "getting started" guide to Fortran 90/95. The main target audience consists of newcomers to the field of numerical computation within Earth system sciences (students, researchers or scientific programmers). Furthermore, readers accustomed to other programming languages may also benefit from this work, by discovering how some programming techniques they are familiar with map to Fortran 95. The main goal is to enable readers to quickly start using Fortran 95 for writing useful programs. It also introduces a gradual discussion of Input/Output facilities relevant for Earth system sciences, from the simplest ones to the more advanced netCDF library (which has become a de facto standard for handling the massive datasets used within Earth system sciences). While related works already treat these disciplines separately (each often providing much more information than needed by the beginning practitioner), the reader finds in this book a shorter guide which links them. Compared to other book...

Public Availability to ECS Collected Datasets

Science.gov (United States)

Henderson, J. F.; Warnken, R.; McLean, S. J.; Lim, E.; Varner, J. D.

2013-12-01

Coastal nations have spent considerable resources exploring the limits of their extended continental shelf (ECS) beyond 200 nm. Although these studies are funded to fulfill requirements of the UN Convention on the Law of the Sea, the investments are producing new data sets in frontier areas of Earth's oceans that will be used to understand, explore, and manage the seafloor and sub-seafloor for decades to come. Although many of these datasets are considered proprietary until a nation's potential ECS has become 'final and binding' an increasing amount of data are being released and utilized by the public. Data sets include multibeam, seismic reflection/refraction, bottom sampling, and geophysical data. The U.S. ECS Project, a multi-agency collaboration whose mission is to establish the full extent of the continental shelf of the United States consistent with international law, relies heavily on data and accurate, standard metadata. The United States has made it a priority to make available to the public all data collected with ECS-funding as quickly as possible. The National Oceanic and Atmospheric Administration's (NOAA) National Geophysical Data Center (NGDC) supports this objective by partnering with academia and other federal government mapping agencies to archive, inventory, and deliver marine mapping data in a coordinated, consistent manner. This includes ensuring quality, standard metadata and developing and maintaining data delivery capabilities built on modern digital data archives. Other countries, such as Ireland, have submitted their ECS data for public availability and many others have made pledges to participate in the future. The data services provided by NGDC support the U.S. ECS effort as well as many developing nation's ECS effort through the U.N. Environmental Program. Modern discovery, visualization, and delivery of scientific data and derived products that span national and international sources of data ensure the greatest re-use of data and
SPATIO-TEMPORAL DATA MODEL FOR INTEGRATING EVOLVING NATION-LEVEL DATASETS

Directory of Open Access Journals (Sweden)

A. Sorokine

2017-10-01

Full Text Available Ability to easily combine the data from diverse sources in a single analytical workflow is one of the greatest promises of the Big Data technologies. However, such integration is often challenging as datasets originate from different vendors, governments, and research communities that results in multiple incompatibilities including data representations, formats, and semantics. Semantics differences are hardest to handle: different communities often use different attribute definitions and associate the records with different sets of evolving geographic entities. Analysis of global socioeconomic variables across multiple datasets over prolonged time is often complicated by the difference in how boundaries and histories of countries or other geographic entities are represented. Here we propose an event-based data model for depicting and tracking histories of evolving geographic units (countries, provinces, etc. and their representations in disparate data. The model addresses the semantic challenge of preserving identity of geographic entities over time by defining criteria for the entity existence, a set of events that may affect its existence, and rules for mapping between different representations (datasets. Proposed model is used for maintaining an evolving compound database of global socioeconomic and environmental data harvested from multiple sources. Practical implementation of our model is demonstrated using PostgreSQL object-relational database with the use of temporal, geospatial, and NoSQL database extensions.
National Hydrography Dataset (NHD) - USGS National Map Downloadable Data Collection

Data.gov (United States)

U.S. Geological Survey, Department of the Interior — The USGS National Hydrography Dataset (NHD) Downloadable Data Collection from The National Map (TNM) is a comprehensive set of digital spatial data that encodes...
Watershed Boundary Dataset (WBD) - USGS National Map Downloadable Data Collection

Data.gov (United States)

U.S. Geological Survey, Department of the Interior — The Watershed Boundary Dataset (WBD) from The National Map (TNM) defines the perimeter of drainage areas formed by the terrain and other landscape characteristics....
Big Data: New science, new challenges, new dialogical opportunities

OpenAIRE

Fuller, Michael

2015-01-01

The advent of extremely large datasets, known as “big data”, has been heralded as the instantiation of a new science, requiring a new kind of practitioner: the “data scientist”. This paper explores the concept of big data, drawing attention to a number of new issues – not least ethical concerns, and questions surrounding interpretation – which big data sets present. It is observed that the skills required for data scientists are in some respects closer to those traditionally associated with t...
Dataset of anomalies and malicious acts in a cyber-physical subsystem.

Science.gov (United States)

Laso, Pedro Merino; Brosset, David; Puentes, John

2017-10-01

This article presents a dataset produced to investigate how data and information quality estimations enable to detect aNomalies and malicious acts in cyber-physical systems. Data were acquired making use of a cyber-physical subsystem consisting of liquid containers for fuel or water, along with its automated control and data acquisition infrastructure. Described data consist of temporal series representing five operational scenarios - Normal, aNomalies, breakdown, sabotages, and cyber-attacks - corresponding to 15 different real situations. The dataset is publicly available in the .zip file published with the article, to investigate and compare faulty operation detection and characterization methods for cyber-physical systems.
Datasets of mung bean proteins and metabolites from four different cultivars

Directory of Open Access Journals (Sweden)

Akiko Hashiguchi

2017-08-01

Full Text Available Plants produce a wide array of nutrients that exert synergistic interaction among whole combinations of nutrients. Therefore comprehensive nutrient profiling is required to evaluate their nutritional/nutraceutical value and health promoting effect. In order to obtain such datasets for mung bean, which is known as a medicinal plant with heat alleviating effect, proteomic and metabolomic analyses were performed using four cultivars from China, Thailand, and Myanmar. In total, 449 proteins and 210 metabolic compounds were identified in seed coat; whereas 480 proteins and 217 metabolic compounds were detected in seed flesh, establishing the first comprehensive dataset of mung bean for nutraceutical evaluation.
participatory development of a minimum dataset for the khayelitsha ...

African Journals Online (AJOL)

This dataset was integrated with data requirements at ... model for defining health information needs at district level. This participatory process has enabled health workers to appraise their .... of reproductive health, mental health, disability and community ... each chose a facilitator and met in between the forum meetings.
Re-inspection of small RNA sequence datasets reveals several novel human miRNA genes.

Directory of Open Access Journals (Sweden)

Thomas Birkballe Hansen

Full Text Available BACKGROUND: miRNAs are key players in gene expression regulation. To fully understand the complex nature of cellular differentiation or initiation and progression of disease, it is important to assess the expression patterns of as many miRNAs as possible. Thereby, identifying novel miRNAs is an essential prerequisite to make possible a comprehensive and coherent understanding of cellular biology. METHODOLOGY/PRINCIPAL FINDINGS: Based on two extensive, but previously published, small RNA sequence datasets from human embryonic stem cells and human embroid bodies, respectively [1], we identified 112 novel miRNA-like structures and were able to validate miRNA processing in 12 out of 17 investigated cases. Several miRNA candidates were furthermore substantiated by including additional available small RNA datasets, thereby demonstrating the power of combining datasets to identify miRNAs that otherwise may be assigned as experimental noise. CONCLUSIONS/SIGNIFICANCE: Our analysis highlights that existing datasets are not yet exhaustedly studied and continuous re-analysis of the available data is important to uncover all features of small RNA sequencing.
An enhanced topologically significant directed random walk in cancer classification using gene expression datasets

Directory of Open Access Journals (Sweden)

Choon Sen Seah

2017-12-01

Full Text Available Microarray technology has become one of the elementary tools for researchers to study the genome of organisms. As the complexity and heterogeneity of cancer is being increasingly appreciated through genomic analysis, cancerous classification is an emerging important trend. Significant directed random walk is proposed as one of the cancerous classification approach which have higher sensitivity of risk gene prediction and higher accuracy of cancer classification. In this paper, the methodology and material used for the experiment are presented. Tuning parameter selection method and weight as parameter are applied in proposed approach. Gene expression dataset is used as the input datasets while pathway dataset is used to build a directed graph, as reference datasets, to complete the bias process in random walk approach. In addition, we demonstrate that our approach can improve sensitive predictions with higher accuracy and biological meaningful classification result. Comparison result takes place between significant directed random walk and directed random walk to show the improvement in term of sensitivity of prediction and accuracy of cancer classification.
Merged SAGE II, Ozone_cci and OMPS ozone profile dataset and evaluation of ozone trends in the stratosphere

Directory of Open Access Journals (Sweden)

V. F. Sofieva

2017-10-01

Full Text Available In this paper, we present a merged dataset of ozone profiles from several satellite instruments: SAGE II on ERBS, GOMOS, SCIAMACHY and MIPAS on Envisat, OSIRIS on Odin, ACE-FTS on SCISAT, and OMPS on Suomi-NPP. The merged dataset is created in the framework of the European Space Agency Climate Change Initiative (Ozone_cci with the aim of analyzing stratospheric ozone trends. For the merged dataset, we used the latest versions of the original ozone datasets. The datasets from the individual instruments have been extensively validated and intercompared; only those datasets which are in good agreement, and do not exhibit significant drifts with respect to collocated ground-based observations and with respect to each other, are used for merging. The long-term SAGE–CCI–OMPS dataset is created by computation and merging of deseasonalized anomalies from individual instruments. The merged SAGE–CCI–OMPS dataset consists of deseasonalized anomalies of ozone in 10° latitude bands from 90° S to 90° N and from 10 to 50 km in steps of 1 km covering the period from October 1984 to July 2016. This newly created dataset is used for evaluating ozone trends in the stratosphere through multiple linear regression. Negative ozone trends in the upper stratosphere are observed before 1997 and positive trends are found after 1997. The upper stratospheric trends are statistically significant at midlatitudes and indicate ozone recovery, as expected from the decrease of stratospheric halogens that started in the middle of the 1990s and stratospheric cooling.
Supervised Variational Relevance Learning, An Analytic Geometric Feature Selection with Applications to Omic Datasets.

Science.gov (United States)

Boareto, Marcelo; Cesar, Jonatas; Leite, Vitor B P; Caticha, Nestor

2015-01-01

We introduce Supervised Variational Relevance Learning (Suvrel), a variational method to determine metric tensors to define distance based similarity in pattern classification, inspired in relevance learning. The variational method is applied to a cost function that penalizes large intraclass distances and favors small interclass distances. We find analytically the metric tensor that minimizes the cost function. Preprocessing the patterns by doing linear transformations using the metric tensor yields a dataset which can be more efficiently classified. We test our methods using publicly available datasets, for some standard classifiers. Among these datasets, two were tested by the MAQC-II project and, even without the use of further preprocessing, our results improve on their performance.
Genome-wide gene expression dataset used to identify potential therapeutic targets in androgenetic alopecia

Directory of Open Access Journals (Sweden)

R. Dey-Rao

2017-08-01

Full Text Available The microarray dataset attached to this report is related to the research article with the title: “A genomic approach to susceptibility and pathogenesis leads to identifying potential novel therapeutic targets in androgenetic alopecia” (Dey-Rao and Sinha, 2017 [1]. Male-pattern hair loss that is induced by androgens (testosterone in genetically predisposed individuals is known as androgenetic alopecia (AGA. The raw dataset is being made publicly available to enable critical and/or extended analyses. Our related research paper utilizes the attached raw dataset, for genome-wide gene-expression associated investigations. Combined with several in silico bioinformatics-based analyses we were able to delineate five strategic molecular elements as potential novel targets towards future AGA-therapy.
Clickstream data yields high-resolution maps of science.

Science.gov (United States)

Bollen, Johan; Van de Sompel, Herbert; Hagberg, Aric; Bettencourt, Luis; Chute, Ryan; Rodriguez, Marko A; Balakireva, Lyudmila

2009-01-01

Intricate maps of science have been created from citation data to visualize the structure of scientific activity. However, most scientific publications are now accessed online. Scholarly web portals record detailed log data at a scale that exceeds the number of all existing citations combined. Such log data is recorded immediately upon publication and keeps track of the sequences of user requests (clickstreams) that are issued by a variety of users across many different domains. Given these advantages of log datasets over citation data, we investigate whether they can produce high-resolution, more current maps of science. Over the course of 2007 and 2008, we collected nearly 1 billion user interactions recorded by the scholarly web portals of some of the most significant publishers, aggregators and institutional consortia. The resulting reference data set covers a significant part of world-wide use of scholarly web portals in 2006, and provides a balanced coverage of the humanities, social sciences, and natural sciences. A journal clickstream model, i.e. a first-order Markov chain, was extracted from the sequences of user interactions in the logs. The clickstream model was validated by comparing it to the Getty Research Institute's Architecture and Art Thesaurus. The resulting model was visualized as a journal network that outlines the relationships between various scientific domains and clarifies the connection of the social sciences and humanities to the natural sciences. Maps of science resulting from large-scale clickstream data provide a detailed, contemporary view of scientific activity and correct the underrepresentation of the social sciences and humanities that is commonly found in citation data.
Semi-supervised tracking of extreme weather events in global spatio-temporal climate datasets

Science.gov (United States)

Kim, S. K.; Prabhat, M.; Williams, D. N.

2017-12-01

Deep neural networks have been successfully applied to solve problem to detect extreme weather events in large scale climate datasets and attend superior performance that overshadows all previous hand-crafted methods. Recent work has shown that multichannel spatiotemporal encoder-decoder CNN architecture is able to localize events in semi-supervised bounding box. Motivated by this work, we propose new learning metric based on Variational Auto-Encoders (VAE) and Long-Short-Term-Memory (LSTM) to track extreme weather events in spatio-temporal dataset. We consider spatio-temporal object tracking problems as learning probabilistic distribution of continuous latent features of auto-encoder using stochastic variational inference. For this, we assume that our datasets are i.i.d and latent features is able to be modeled by Gaussian distribution. In proposed metric, we first train VAE to generate approximate posterior given multichannel climate input with an extreme climate event at fixed time. Then, we predict bounding box, location and class of extreme climate events using convolutional layers given input concatenating three features including embedding, sampled mean and standard deviation. Lastly, we train LSTM with concatenated input to learn timely information of dataset by recurrently feeding output back to next time-step's input of VAE. Our contribution is two-fold. First, we show the first semi-supervised end-to-end architecture based on VAE to track extreme weather events which can apply to massive scaled unlabeled climate datasets. Second, the information of timely movement of events is considered for bounding box prediction using LSTM which can improve accuracy of localization. To our knowledge, this technique has not been explored neither in climate community or in Machine Learning community.
Software ion scan functions in analysis of glycomic and lipidomic MS/MS datasets.

Science.gov (United States)

Haramija, Marko

2018-03-01

Hardware ion scan functions unique to tandem mass spectrometry (MS/MS) mode of data acquisition, such as precursor ion scan (PIS) and neutral loss scan (NLS), are important for selective extraction of key structural data from complex MS/MS spectra. However, their software counterparts, software ion scan (SIS) functions, are still not regularly available. Software ion scan functions can be easily coded for additional functionalities, such as software multiple precursor ion scan, software no ion scan, and software variable ion scan functions. These are often necessary, since they allow more efficient analysis of complex MS/MS datasets, often encountered in glycomics and lipidomics. Software ion scan functions can be easily coded by using modern script languages and can be independent of instrument manufacturer. Here we demonstrate the utility of SIS functions on a medium-size glycomic MS/MS dataset. Knowledge of sample properties, as well as of diagnostic and conditional diagnostic ions crucial for data analysis, was needed. Based on the tables constructed with the output data from the SIS functions performed, a detailed analysis of a complex MS/MS glycomic dataset could be carried out in a quick, accurate, and efficient manner. Glycomic research is progressing slowly, and with respect to the MS experiments, one of the key obstacles for moving forward is the lack of appropriate bioinformatic tools necessary for fast analysis of glycomic MS/MS datasets. Adding novel SIS functionalities to the glycomic MS/MS toolbox has a potential to significantly speed up the glycomic data analysis process. Similar tools are useful for analysis of lipidomic MS/MS datasets as well, as will be discussed briefly. Copyright © 2017 John Wiley & Sons, Ltd.
On standardization of basic datasets of electronic medical records in traditional Chinese medicine.

Science.gov (United States)

Zhang, Hong; Ni, Wandong; Li, Jing; Jiang, Youlin; Liu, Kunjing; Ma, Zhaohui

2017-12-24

Standardization of electronic medical record, so as to enable resource-sharing and information exchange among medical institutions has become inevitable in view of the ever increasing medical information. The current research is an effort towards the standardization of basic dataset of electronic medical records in traditional Chinese medicine. In this work, an outpatient clinical information model and an inpatient clinical information model are created to adequately depict the diagnosis processes and treatment procedures of traditional Chinese medicine. To be backward compatible with the existing dataset standard created for western medicine, the new standard shall be a superset of the existing standard. Thus, the two models are checked against the existing standard in conjunction with 170,000 medical record cases. If a case cannot be covered by the existing standard due to the particularity of Chinese medicine, then either an existing data element is expanded with some Chinese medicine contents or a new data element is created. Some dataset subsets are also created to group and record Chinese medicine special diagnoses and treatments such as acupuncture. The outcome of this research is a proposal of standardized traditional Chinese medicine medical records datasets. The proposal has been verified successfully in three medical institutions with hundreds of thousands of medical records. A new dataset standard for traditional Chinese medicine is proposed in this paper. The proposed standard, covering traditional Chinese medicine as well as western medicine, is expected to be soon approved by the authority. A widespread adoption of this proposal will enable traditional Chinese medicine hospitals and institutions to easily exchange information and share resources. Copyright © 2017. Published by Elsevier B.V.
Data science ethics in government.

Science.gov (United States)

Drew, Cat

2016-12-28

Data science can offer huge opportunities for government. With the ability to process larger and more complex datasets than ever before, it can provide better insights for policymakers and make services more tailored and efficient. As with all new technologies, there is a risk that we do not take up its opportunities and miss out on its enormous potential. We want people to feel confident to innovate with data. So, over the past 18 months, the Government Data Science Partnership has taken an open, evidence-based and user-centred approach to creating an ethical framework. It is a practical document that brings all the legal guidance together in one place, and is written in the context of new data science capabilities. As part of its development, we ran a public dialogue on data science ethics, including deliberative workshops, an experimental conjoint survey and an online engagement tool. The research supported the principles set out in the framework as well as provided useful insight into how we need to communicate about data science. It found that people had a low awareness of the term 'data science', but that showing data science examples can increase broad support for government exploring innovative uses of data. But people's support is highly context driven. People consider acceptability on a case-by-case basis, first thinking about the overall policy goals and likely intended outcome, and then weighing up privacy and unintended consequences. The ethical framework is a crucial start, but it does not solve all the challenges it highlights, particularly as technology is creating new challenges and opportunities every day. Continued research is needed into data minimization and anonymization, robust data models, algorithmic accountability, and transparency and data security. It also has revealed the need to set out a renewed deal between the citizen and state on data, to maintain and solidify trust in how we use people's data for social good.This article is part
A large-scale dataset of solar event reports from automated feature recognition modules

Science.gov (United States)

Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.

2016-05-01

The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.
Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

Science.gov (United States)

Keuleers, Emmanuel; Balota, David A

2015-01-01

This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.