WorldWideScience

Sample records for fluxnet synthesis dataset

  1. Fluxnet Synthesis Dataset Collaboration Infrastructure

    Energy Technology Data Exchange (ETDEWEB)

    Agarwal, Deborah A. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Humphrey, Marty [Univ. of Virginia, Charlottesville, VA (United States); van Ingen, Catharine [Microsoft. San Francisco, CA (United States); Beekwilder, Norm [Univ. of Virginia, Charlottesville, VA (United States); Goode, Monte [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Jackson, Keith [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Rodriguez, Matt [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Weber, Robin [Univ. of California, Berkeley, CA (United States)

    2008-02-06

    The Fluxnet synthesis dataset originally compiled for the La Thuile workshop contained approximately 600 site years. Since the workshop, several additional site years have been added and the dataset now contains over 920 site years from over 240 sites. A data refresh update is expected to increase those numbers in the next few months. The ancillary data describing the sites continues to evolve as well. There are on the order of 120 site contacts and 60proposals have been approved to use thedata. These proposals involve around 120 researchers. The size and complexity of the dataset and collaboration has led to a new approach to providing access to the data and collaboration support and the support team attended the workshop and worked closely with the attendees and the Fluxnet project office to define the requirements for the support infrastructure. As a result of this effort, a new website (http://www.fluxdata.org) has been created to provide access to the Fluxnet synthesis dataset. This new web site is based on a scientific data server which enables browsing of the data on-line, data download, and version tracking. We leverage database and data analysis tools such as OLAP data cubes and web reports to enable browser and Excel pivot table access to the data.

  2. Assimilation exceeds respiration sensitivity to drought : A FLUXNET synthesis

    NARCIS (Netherlands)

    Schwalm, Christopher R.; Williams, Christopher A.; Schaefer, Kevin; Arneth, Almut; Bonal, Damien; Buchmann, Nina; Chen, Jiquan; Law, Beverlye; Lindroth, Anders; Luyssaert, Sebastiaan; Reichstein, Markus; Richardson, Andrew D.

    The intensification of the hydrological cycle, with an observed and modeled increase in drought incidence and severity, underscores the need to quantify drought effects on carbon cycling and the terrestrial sink. FLUXNET, a global network of eddy covariance towers, provides dense data streams of

  3. Nitrogen deposition's role in determining forest photosynthetic capacity; a FLUXNET synthesis

    Science.gov (United States)

    Fleischer, K.; Rebel, K.; van der Molen, M.; Erisman, J.; Wassen, M.; Dolman, H.

    2011-12-01

    There is growing evidence that nitrogen (N) deposition stimulates forest growth, as many forest ecosystems are N-limited. However, the significance of N deposition in determining the strength of the present and future terrestrial carbon sink is strongly debated. We investigated and quantified the effect of N deposition on ecosystem photosynthetic capacity (Amax) with the FLUXNET database, including 80 forest sites, covering the major forest types and climates of the world. The relative effect of climate and N deposition on photosynthesis was assessed with regression models. We found a significant positive correlation of Amax and N deposition for evergreen needleleaf forests in our dataset. We further found indications that foliar N and LAI scale positively with N deposition, reflecting the 2 mechanisms at which N is believed to cause an increase in carbon gain. We can support the hypothesis that foliar N is the principal scaling factor for canopy Amax across all forest types. Deciduous forests are less diverse in terms of climate and nutritional conditions for the included sites and these forests exhibited weak to no correlations with the included climate and N predictor variables. Quantifying the effect of N deposition on photosynthetic rates at the canopy level is an essential step for quantifying its contribution to the terrestrial carbon sink and for predicting vegetation response to N fertilization and global change in the future. The approach shows that eddy-covariance measurements of carbon fluxes at the canopy scale allow us to test hypotheses with respect to the expected nitrogen-photosynthesis relationships at the canopy scale.

  4. Here the data: the new FLUXNET collection and the future for model-data integration

    Science.gov (United States)

    Papale, D.; Pastorello, G.; Trotta, C.; Chu, H.; Canfora, E.; Agarwal, D.; Baldocchi, D. D.; Torn, M. S.

    2016-12-01

    Seven years after the release of the LaThuile FLUXNET database, widely used in synthesis activities and model-data fusion exercises, a new FLUXNET collection has been released (FLUXNET 2015 - http://fluxnet.fluxdata.org) with the aim to increase the quality of the measurements and provide high quality standardized data obtained by a new processing pipeline. The new FLUXNET collection includes also sites with timeseries of 20 years of continuous carbon and energy fluxes, opening new opportunities in their use in the context of models parameterization and validation. The main characteristics of the FLUXNET 2015 dataset are the uncertainty quantification, the multiple products (e.g. partitioning in photosynthesis and ecosystem respiration) that allow consistency analysis for each site, and new long term downscaled meteorological data provided with the data. Feedbacks from new users, in particular from the modelling communities, are crucial to further improve the quality of the products and move in the direction of a coherent integration across multi-disciplinary communities. In this presentation, the new FLUXNET2015 dataset will be explained and explored, with particular focus on the meaning of the different products and variables, their potentiality but also their limitations. The future development of the dataset will be discussed, with the role of the regional networks and the ongoing efforts to provide new and advanced services such a near real time data provision and a completely open access policy to high quality standardized measurements of GHGs exchanges and additional ecological quantities.

  5. Data Synthesis and Data Assimilation at Global Change Experiments and Fluxnet Toward Improving Land Process Models

    Energy Technology Data Exchange (ETDEWEB)

    Luo, Yiqi [Univ. of Oklahoma, Norman, OK (United States). Dept. of Microbiology and Plant Biology

    2017-09-12

    The project was conducted during the period from 7/1/2012 to 6/30/2017 with three major tasks: (1) data synthesis and development of data assimilation (DA) techniques to constrain modeled ecosystem feedback to climate change; (2) applications of DA techniques to improve process models at different scales from ecosystem to regions and the globe; and 3) improvements of modeling soil carbon (C) dynamics by land surface models. During this period, we have synthesized published data from soil incubation experiments (e.g., Chen et al., 2016; Xu et al., 2016; Feng et al., 2016), global change experiments (e.g., Li et al., 2013; Shi et al., 2015, 2016; Liang et al., 2016) and fluxnet (e.g., Niu et al., 2012., Xia et al., 2015; Li et al., 2016). These data have been organized into multiple data products and have been used to identify general mechanisms and estimate parameters for model improvement. We used the data sets that we collected and the DA techniques to improve model performance of both ecosystem models and global land models. The objectives are: 1) to improve model simulations of litter and soil carbon storage (e.g., Schädel et al., 2013; Hararuk and Luo, 2014; Hararuk et al., 2014; Liang et al., 2015); 2) to explore the effects of CO2, warming and precipitation on ecosystem processes (e.g., van Groenigen et al., 2014; Shi et al., 2015, 2016; Feng et al., 2017); and 3) to estimate parameters variability in different ecosystems (e.g., Li et al., 2016). We developed a traceability framework, which was based on matrix approaches and decomposed the modeled steady-state terrestrial ecosystem carbon storage capacity into four can trace the difference in ecosystem carbon storage capacity among different biomes to four traceable components: net primary productivity (NPP), baseline C residence times, environmental scalars and climate forcing (Xia et al., 2013). With this framework, we can diagnose the differences in modeled carbon storage across ecosystems

  6. Interaction Support for the Global Fluxnet Data Set

    Science.gov (United States)

    Agarwal, D.; Humphrey, M.; Beekwilder, N.; Goode, M.; Jackson, K.; Weber, R.; van Ingen, C.; Baldocchi, D.

    2008-12-01

    The FLUXNET synthesis data set contains on the order of 960 site-years of sensor data from over 260 sites around the world. This is a living data set; a data update this year should add new site-years from over 200 sites. The data are the ground truth for carbon-climate studies linking models and remote sensing as well as comparative field analyses. Over 65 synthesis teams are using this data to do global and regional scale analyses. The size of the dataset makes browsing the data difficult; for example, a search of the dataset for sites with particular meteorological characteristics would require a download of the complete dataset and then running all of the data through a preliminary analysis. Synthesis studies often need additional non- sensor measurements such as root biomass, soil composition, or fire occurrence; some of these variables require detailed knowledge of the site and the science. The large number of sites makes the assembly, cleaning, and long term curation of the non-sensor data daunting; a virtual conversation between the data providers, data users, and data curators is needed. The large number of sites also makes tracking updates to the site information and communicating with site PIs difficult for synthesis study teams. We have developed a collaborative web portal which enables data browsing on line, orchestrates the data curation virtual conversation, and enables the synthesis team conversation with sites. Behind the portal is an archive database and OLAP data cube for simple data browsing through query. Scientists can download data files, browse data summaries, update metadata and annotate the data through the portal. Synthesis teams can select sites and exchange e-mail with those sites through the portal. The data can also be browsed directly from Excel spreadsheets or MatLab from the scientist desktop; the scientist sees no difference between data "in the cloud" and on the desktop. We believe the portal enables science researchers to

  7. The role of spring and autumn phenological switches on spatiotemporal variation in temperate and boreal forest C balance: A FLUXNET synthesis

    Science.gov (United States)

    Richardson, A. D.; Reichstein, M.; Piao, S.; Ciais, P.; Luyssaert, S.; Stockli, R.; Friedl, M.; Gobron, N.; Fluxnet Site Pis, 21

    2009-04-01

    In temperate and boreal ecosystems, phenological transitions (particularly the timing of spring onset and autumn senescence) are thought to represent a major control on spatial and temporal variation in forest carbon sequestration. To investigate these patterns, we analyzed 153 site-years of data from the FLUXNET ‘La Thuile' database. Eddy covariance measurements of surface-atmosphere exchanges of carbon and water from 21 research sites at latitudes from 36°N to 67°N were used in the synthesis. We defined a range of phenological indicators based on the first (spring) and last (autumn) dates of (1) C source/sink transitions (‘carbon uptake period'); (2) measurable photosynthetic uptake (‘physiologically active period'); (3) relative thresholds for latent heat (evapotranspiration) flux; (4) phenological thresholds derived from a range of remote sensing products (JRC fAPAR, MOD12Q2, and the PROGNOSTIC model with MODIS data assimilation); and (5) a climatological metric based on the date where soil temperature equals mean annual air temperature. We then tested whether site-level flux anomalies were significantly correlated with phenological anomalies across these metrics, and whether the slopes of these relationships (representing the sensitivity to phenological variation) differed between deciduous broadleaf (DBF) and evergreen needleleaf (ENF) forests. Within sites, interannual variation in most phenological metrics was about 5-10 d, compared to 10-30 d across sites. Both spatial and temporal phenological variation were consistently larger at ENF, compared to DBF, sites. Averaged across metrics, phenological variability was roughly comparable in spring and autumn, both across (17 d) and within (9 d) sites. However, patterns of interannual variation in fluxes were less well explained by the derived phenological metrics than were patterns of spatial variation in fluxes. Also, the observed pattern strongly depended on the metric used, with flux-derived metrics

  8. History and Future for the Happy Marriage between the MODIS Land team and Fluxnet

    Science.gov (United States)

    Running, S. W.

    2015-12-01

    When I wrote the proposal to NASA in 1988 for daily global evapotranspiration and gross primary production algorithms for the MODIS sensor, I had no validation plan. Fluxnet probably saved my MODIS career by developing a global network of rigorously calibrated towers measuring water and carbon fluxes over a wide variety of ecosystems that I could not even envision at the time that first proposal was written. However my enthusiasm for Fluxnet was not reciprocated by the Fluxnet community until we began providing 7 x 7 pixel MODIS Land datasets exactly over each of their towers every 8 days, without them having to crawl thru the global datasets and make individual orders. This system, known informally as the MODIS ASCII cutouts, began in 2002 and operates at the Oak Ridge DAAC to this day, cementing a mutually beneficial data interchange between the Fluxnet and remote sensing communities. This talk will briefly discuss the history of MODIS validation with flux towers, and flux spatial scaling with MODIS data. More importantly I will detail the future continuity of global biophysical datasets in the post-MODIS era, and what next generation sensors will provide.

  9. FLUXNET. Database of fluxes, site characteristics, and flux-community information

    Energy Technology Data Exchange (ETDEWEB)

    Olson, R. J. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Holladay, S. K. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Cook, R. B. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Falge, E. [Univ. Bayreuth, Bayreuth (Germany); Baldocchi, D. [Univ. of California, Berkeley, CA (United States); Gu, L. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

    2004-02-28

    FLUXNET is a “network of regional networks” created by international scientists to coordinate regional and global analysis of observations from micrometeorological tower sites. The flux tower sites use eddy covariance methods to measure the exchanges of carbon dioxide (CO2), water vapor, and energy between terrestrial ecosystems and the atmosphere. FLUXNET’S goals are to aid in understanding the mechanisms controlling the exchanges of CO2, water vapor, and energy across a range of time (0.5 hours to annual periods) and space scales. FLUXNET provides an infrastructure for the synthesis and analysis of world-wide, long-term flux data compiled from various regional flux networks. Information compiled by the FLUXNET project is being used to validate remote sensing products associated with the National Aeronautics and Space Administration (NASA) Terra and Aqua satellites. FLUXNET provides access to ground information for validating estimates of net primary productivity, and energy absorption that are being generated by the Moderate Resolution Imaging Spectroradiometer (MODIS) sensors. In addition, this information is also used to develop and validate ecosystem models.

  10. FLUXNET: A Global Network of Eddy-Covariance Flux Towers

    Science.gov (United States)

    Cook, R. B.; Holladay, S. K.; Margle, S. M.; Olsen, L. M.; Gu, L.; Heinsch, F.; Baldocchi, D.

    2003-12-01

    The FLUXNET global network was established to aid in understanding the mechanisms controlling the exchanges of carbon dioxide, water vapor, and energy across a variety of terrestrial ecosystems. Flux tower data are also being used to validate ecosystem model outputs and to provide information for validating remote sensing based products, including surface temperature, reflectance, albedo, vegetation indices, leaf area index, photosynthetically active radiation, and photosynthesis derived from MODIS sensors on the Terra and Aqua satellites. The global FLUXNET database provides consistent and complete flux data to support global carbon cycle science. Currently FLUXNET consists of over 210 sites, with most flux towers operating continuously for 4 years or longer. Gap-filled data are available for 53 sites. The FLUXNET database contains carbon, water vapor, sensible heat, momentum, and radiation flux measurements with associated ancillary and value-added data products. Towers are located in temperate conifer and broadleaf forests, tropical and boreal forests, crops, grasslands, chaparral, wetlands, and tundra on five continents. Selected MODIS Land products in the immediate vicinity of the flux tower are subsetted and posted on the FLUXNET Web site for 169 flux-towers. The MODIS subsets are prepared in ASCII format for 8-day periods for an area 7 x 7 km around the tower.

  11. We did well but we definitely have to do better: four critical points about fluxnet

    Science.gov (United States)

    Kutsch, W. L.

    2014-12-01

    Fluxnet is a real success story of data integration. The scientific outcome is overwhelming. Nevertheless: in a time of methodological consolidation and transfer of the networks to technically more integrated infrastructures, a critical view on its weak points may strengthen the future success and our position within biogeochemical science. Four points should be discussed: We have to select our sites more thoroughly. We need better data curation. We should think about 'forgetting' some of the older datasets. We have responsibility for the results of integration studies. ad 1: We had to learn during the past years that the EC is not applicable in all terrains. Slope and footprint problems are widespread and sites have to be critically scrutinized before being sure that we submit valuable ecological information. This is time consuming and may be frustrating since we have to accept that we had sometimes invested lots of work and money for building a flux tower at a site that is not suitable for the method. Nevertheless, a clear site quality policy should be developed among infrastructures and integrating activities. ad 2: In some cases it has turned out that the information about different steps leading from the raw data to a number in integrated scientific papers has been lost. This is a big challenge to research infrastructures that should develop common rules for data curation to increase trust in integration activities. ad 3: In the first approach Fluxnet left the responsibility for site and data quality to the site PI and accepted more or less all data submitted. Further approaches and in particular long-term infrastructures have to develop strategies to reject (or at least flag) data from sites that are prone by terrain problems. This includes that in future integration studies we should stop using some of the datasets from the 'wild old times' when we did not know better. ad4: We need a strategy to communicate with data users that are far away from practical

  12. Multi-site evaluation of terrestrial evaporation models using FLUXNET data

    KAUST Repository

    Ershadi, Ali; McCabe, Matthew; Evans, Jason P.; Chaney, Nathaniel W.; Wood, Eric F.

    2014-01-01

    We evaluated the performance of four commonly applied land surface evaporation models using a high-quality dataset of selected FLUXNET towers. The models that were examined include an energy balance approach (Surface Energy Balance System; SEBS), a combination-type technique (single-source Penman-Monteith; PM), a complementary method (advection-aridity; AA) and a radiation based approach (modified Priestley-Taylor; PT-JPL). Twenty FLUXNET towers were selected based upon satisfying stringent forcing data requirements and representing a wide range of biomes. These towers encompassed a number of grassland, cropland, shrubland, evergreen needleleaf forest and deciduous broadleaf forest sites. Based on the mean value of the Nash-Sutcliffe efficiency (NSE) and the root mean squared difference (RMSD), the order of overall performance of the models from best to worst were: ensemble mean of models (0.61, 64), PT-JPL (0.59, 66), SEBS (0.42, 84), PM (0.26, 105) and AA (0.18, 105) [statistics stated as (NSE, RMSD in Wm-2)]. Although PT-JPL uses a relatively simple and largely empirical formulation of the evaporative process, the technique showed improved performance compared to PM, possibly due to its partitioning of total evaporation (canopy transpiration, soil evaporation, wet canopy evaporation) and lower uncertainties in the required forcing data. The SEBS model showed low performance over tall and heterogeneous canopies, which was likely a consequence of the effects of the roughness sub-layer parameterization employed in this scheme. However, SEBS performed well overall. Relative to PT-JPL and SEBS, the PM and AA showed low performance over the majority of sites, due to their sensitivity to the parameterization of resistances. Importantly, it should be noted that no single model was consistently best across all biomes. Indeed, this outcome highlights the need for further evaluation of each model's structure and parameterizations to identify sensitivities and their

  13. Multi-site evaluation of terrestrial evaporation models using FLUXNET data

    KAUST Repository

    Ershadi, Ali

    2014-04-01

    We evaluated the performance of four commonly applied land surface evaporation models using a high-quality dataset of selected FLUXNET towers. The models that were examined include an energy balance approach (Surface Energy Balance System; SEBS), a combination-type technique (single-source Penman-Monteith; PM), a complementary method (advection-aridity; AA) and a radiation based approach (modified Priestley-Taylor; PT-JPL). Twenty FLUXNET towers were selected based upon satisfying stringent forcing data requirements and representing a wide range of biomes. These towers encompassed a number of grassland, cropland, shrubland, evergreen needleleaf forest and deciduous broadleaf forest sites. Based on the mean value of the Nash-Sutcliffe efficiency (NSE) and the root mean squared difference (RMSD), the order of overall performance of the models from best to worst were: ensemble mean of models (0.61, 64), PT-JPL (0.59, 66), SEBS (0.42, 84), PM (0.26, 105) and AA (0.18, 105) [statistics stated as (NSE, RMSD in Wm-2)]. Although PT-JPL uses a relatively simple and largely empirical formulation of the evaporative process, the technique showed improved performance compared to PM, possibly due to its partitioning of total evaporation (canopy transpiration, soil evaporation, wet canopy evaporation) and lower uncertainties in the required forcing data. The SEBS model showed low performance over tall and heterogeneous canopies, which was likely a consequence of the effects of the roughness sub-layer parameterization employed in this scheme. However, SEBS performed well overall. Relative to PT-JPL and SEBS, the PM and AA showed low performance over the majority of sites, due to their sensitivity to the parameterization of resistances. Importantly, it should be noted that no single model was consistently best across all biomes. Indeed, this outcome highlights the need for further evaluation of each model\\'s structure and parameterizations to identify sensitivities and their

  14. Allometric growth and allocation in forests: a perspective from FLUXNET.

    Science.gov (United States)

    Wolf, Adam; Field, Christopher B; Berry, Joseph A

    2011-07-01

    To develop a scheme for partitioning the products of photosynthesis toward different biomass components in land-surface models, a database on component mass and net primary productivity (NPP), collected from FLUXNET sites, was examined to determine allometric patterns of allocation. We found that NPP per individual of foliage (Gfol), stem and branches (Gstem), coarse roots (Gcroot) and fine roots (Gfroot) in individual trees is largely explained (r2 = 67-91%) by the magnitude of total NPP per individual (G). Gfol scales with G isometrically, meaning it is a fixed fraction of G ( 25%). Root-shoot trade-offs were manifest as a slow decline in Gfroot, as a fraction of G, from 50% to 25% as stands increased in biomass, with Gstem and Gcroot increasing as a consequence. These results indicate that a functional trade-off between aboveground and belowground allocation is essentially captured by variations in G, which itself is largely governed by stand biomass and only secondarily by site-specific resource availability. We argue that forests are characterized by strong competition for light, observed as a race for individual trees to ascend by increasing partitioning toward wood, rather than by growing more leaves, and that this competition stronglyconstrains the allocational plasticity that trees may be capable of. The residual variation in partitioning was not related to climatic or edaphic factors, nor did plots with nutrient or water additions show a pattern of partitioning distinct from that predicted by G alone. These findings leverage short-term process studies of the terrestrial carbon cycle to improve decade-scale predictions of biomass accumulation in forests. An algorithm for calculating partitioning in land-surface models is presented.

  15. Reconciling leaf physiological traits and canopy flux data: Use of the TRY and FLUXNET databases in the Community Land Model version 4

    Science.gov (United States)

    Bonan, Gordon B.; Oleson, Keith W.; Fisher, Rosie A.; Lasslop, Gitta; Reichstein, Markus

    2012-06-01

    The Community Land Model version 4 overestimates gross primary production (GPP) compared with estimates from FLUXNET eddy covariance towers. The revised model of Bonan et al. (2011) is consistent with FLUXNET, but values for the leaf-level photosynthetic parameterVcmaxthat yield realistic GPP at the canopy-scale are lower than observed in the global synthesis of Kattge et al. (2009), except for tropical broadleaf evergreen trees. We investigate this discrepancy betweenVcmaxand canopy fluxes. A multilayer model with explicit calculation of light absorption and photosynthesis for sunlit and shaded leaves at depths in the canopy gives insight to the scale mismatch between leaf and canopy. We evaluate the model with light-response curves at individual FLUXNET towers and with empirically upscaled annual GPP. Biases in the multilayer canopy with observedVcmaxare similar, or improved, compared with the standard two-leaf canopy and its lowVcmax, though the Amazon is an exception. The difference relates to light absorption by shaded leaves in the two-leaf canopy, and resulting higher photosynthesis when the canopy scaling parameterKn is low, but observationally constrained. Larger Kndecreases shaded leaf photosynthesis and reduces the difference between the two-leaf and multilayer canopies. The low modelVcmaxis diagnosed from nitrogen reduction of GPP in simulations with carbon-nitrogen biogeochemistry. Our results show that the imposed nitrogen reduction compensates for deficiency in the two-leaf canopy that produces high GPP. Leaf trait databases (Vcmax), within-canopy profiles of photosynthetic capacity (Kn), tower fluxes, and empirically upscaled fields provide important complementary information for model evaluation.

  16. Derivation and analysis of cross relations of photosynthesis and respiration across at FLUXNET sites for model improvement

    Science.gov (United States)

    Lasslop, G.; Reichstein, M.; Papale, D.; Richardson, A. D.

    2009-12-01

    The FLUXNET database provides measurements of the net ecosystem exchange (NEE) of carbon across vegetation types and climate regions. To simplify the interpretation in terms of processes the net exchange is frequently split up into the two main components: gross primary production (GPP) and ecosystem respiration (Reco). A strong relation between these two fluxes related derived from eddy covariance data was found across temporal scales and is to be expected as variation in recent photosynthesis is known to be correlated with root respiration; plants use energy from photosynthesis to drive the metabolism. At long time scales, substrate availability (constrained by past productivity) limits the whole-ecosystem respiration. Previous studies exploring this relationship relied on GPP and Reco estimates derived from the same data, this may lead to spurious correlation that must not be interpreted ecologically. In this study we use two estimates derived from disjunct datasets, one based on daytime data, the other on nighttime data and explore the reliability and robustness of this relationship. We find distinct relationship between the two, varying between vegetation types but also across temporal and spatial scales. We also infer that spatial and temporal variability of net ecosystem exchange is driven by GPP in many cases. Exceptions to this rule include for example disturbed sites. We advocate that for model calibration and evaluation not only the fluxes itself but also robust patterns between fluxes that can be extracted from the database, for instance between the flux components, should be considered.

  17. Fluxes all of the time? A primer on the temporal representativeness of FLUXNET

    Science.gov (United States)

    Chu, Housen; Baldocchi, Dennis D.; John, Ranjeet; Wolf, Sebastian; Reichstein, Markus

    2017-02-01

    FLUXNET, the global network of eddy covariance flux towers, provides the largest synthesized data set of CO2, H2O, and energy fluxes. To achieve the ultimate goal of providing flux information "everywhere and all of the time," studies have attempted to address the representativeness issue, i.e., whether measurements taken in a set of given locations and measurement periods can be extrapolated to a space- and time-explicit extent (e.g., terrestrial globe, 1982-2013 climatological baseline). This study focuses on the temporal representativeness of FLUXNET and tests whether site-specific measurement periods are sufficient to capture the natural variability of climatological and biological conditions. FLUXNET is unevenly representative across sites in terms of the measurement lengths and potentials of extrapolation in time. Similarity of driver conditions among years generally enables the extrapolation of flux information beyond measurement periods. Yet such extrapolation potentials are further constrained by site-specific variability of driver conditions. Several driver variables such as air temperature, diurnal temperature range, potential evapotranspiration, and normalized difference vegetation index had detectable trends and/or breakpoints within the baseline period, and flux measurements generally covered similar and biased conditions in those drivers. About 38% and 60% of FLUXNET sites adequately sampled the mean conditions and interannual variability of all driver conditions, respectively. For long-record sites (≥15 years) the percentages increased to 59% and 69%, respectively. However, the justification of temporal representativeness should not rely solely on the lengths of measurements. Whenever possible, site-specific consideration (e.g., trend, breakpoint, and interannual variability in drivers) should be taken into account.

  18. Proteomics dataset

    DEFF Research Database (Denmark)

    Bennike, Tue Bjerg; Carlsen, Thomas Gelsing; Ellingsen, Torkell

    2017-01-01

    The datasets presented in this article are related to the research articles entitled “Neutrophil Extracellular Traps in Ulcerative Colitis: A Proteome Analysis of Intestinal Biopsies” (Bennike et al., 2015 [1]), and “Proteome Analysis of Rheumatoid Arthritis Gut Mucosa” (Bennike et al., 2017 [2])...... been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD001608 for ulcerative colitis and control samples, and PXD003082 for rheumatoid arthritis samples....

  19. Data Assimilation at FLUXNET to Improve Models towards Ecological Forecasting (Invited)

    Science.gov (United States)

    Luo, Y.

    2009-12-01

    Dramatically increased volumes of data from observational and experimental networks such as FLUXNET call for transformation of ecological research to increase its emphasis on quantitative forecasting. Ecological forecasting will also meet the societal need to develop better strategies for natural resource management in a world of ongoing global change. Traditionally, ecological forecasting has been based on process-based models, informed by data in largely ad hoc ways. Although most ecological models incorporate some representation of mechanistic processes, today’s ecological models are generally not adequate to quantify real-world dynamics and provide reliable forecasts with accompanying estimates of uncertainty. A key tool to improve ecological forecasting is data assimilation, which uses data to inform initial conditions and to help constrain a model during simulation to yield results that approximate reality as closely as possible. In an era with dramatically increased availability of data from observational and experimental networks, data assimilation is a key technique that helps convert the raw data into ecologically meaningful products so as to accelerate our understanding of ecological processes, test ecological theory, forecast changes in ecological services, and better serve the society. This talk will use examples to illustrate how data from FLUXNET have been assimilated with process-based models to improve estimates of model parameters and state variables; to quantify uncertainties in ecological forecasting arising from observations, models and their interactions; and to evaluate information contributions of data and model toward short- and long-term forecasting of ecosystem responses to global change.

  20. Deriving global parameter estimates for the Noah land surface model using FLUXNET and machine learning

    Science.gov (United States)

    Chaney, Nathaniel W.; Herman, Jonathan D.; Ek, Michael B.; Wood, Eric F.

    2016-11-01

    With their origins in numerical weather prediction and climate modeling, land surface models aim to accurately partition the surface energy balance. An overlooked challenge in these schemes is the role of model parameter uncertainty, particularly at unmonitored sites. This study provides global parameter estimates for the Noah land surface model using 85 eddy covariance sites in the global FLUXNET network. The at-site parameters are first calibrated using a Latin Hypercube-based ensemble of the most sensitive parameters, determined by the Sobol method, to be the minimum stomatal resistance (rs,min), the Zilitinkevich empirical constant (Czil), and the bare soil evaporation exponent (fxexp). Calibration leads to an increase in the mean Kling-Gupta Efficiency performance metric from 0.54 to 0.71. These calibrated parameter sets are then related to local environmental characteristics using the Extra-Trees machine learning algorithm. The fitted Extra-Trees model is used to map the optimal parameter sets over the globe at a 5 km spatial resolution. The leave-one-out cross validation of the mapped parameters using the Noah land surface model suggests that there is the potential to skillfully relate calibrated model parameter sets to local environmental characteristics. The results demonstrate the potential to use FLUXNET to tune the parameterizations of surface fluxes in land surface models and to provide improved parameter estimates over the globe.

  1. Correcting surface solar radiation of two data assimilation systems against FLUXNET observations in North America

    Science.gov (United States)

    Zhao, Lei; Lee, Xuhui; Liu, Shoudong

    2013-09-01

    Solar radiation at the Earth's surface is an important driver of meteorological and ecological processes. The objective of this study is to evaluate the accuracy of the reanalysis solar radiation produced by NARR (North American Regional Reanalysis) and MERRA (Modern-Era Retrospective Analysis for Research and Applications) against the FLUXNET measurements in North America. We found that both assimilation systems systematically overestimated the surface solar radiation flux on the monthly and annual scale, with an average bias error of +37.2 Wm-2 for NARR and of +20.2 Wm-2 for MERRA. The bias errors were larger under cloudy skies than under clear skies. A postreanalysis algorithm consisting of empirical relationships between model bias, a clearness index, and site elevation was proposed to correct the model errors. Results show that the algorithm can remove the systematic bias errors for both FLUXNET calibration sites (sites used to establish the algorithm) and independent validation sites. After correction, the average annual mean bias errors were reduced to +1.3 Wm-2 for NARR and +2.7 Wm-2 for MERRA. Applying the correction algorithm to the global domain of MERRA brought the global mean surface incoming shortwave radiation down by 17.3 W m-2 to 175.5 W m-2. Under the constraint of the energy balance, other radiation and energy balance terms at the Earth's surface, estimated from independent global data products, also support the need for a downward adjustment of the MERRA surface solar radiation.

  2. Proteomics dataset

    DEFF Research Database (Denmark)

    Bennike, Tue Bjerg; Carlsen, Thomas Gelsing; Ellingsen, Torkell

    2017-01-01

    patients (Morgan et al., 2012; Abraham and Medzhitov, 2011; Bennike, 2014) [8–10. Therefore, we characterized the proteome of colon mucosa biopsies from 10 inflammatory bowel disease ulcerative colitis (UC) patients, 11 gastrointestinal healthy rheumatoid arthritis (RA) patients, and 10 controls. We...... been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD001608 for ulcerative colitis and control samples, and PXD003082 for rheumatoid arthritis samples....

  3. A comprehensive evaluation of two MODIS evapotranspiration products over the conterminous United States: using point and gridded FLUXNET and water balance ET

    Science.gov (United States)

    Velpuri, Naga M.; Senay, Gabriel B.; Singh, Ramesh K.; Bohms, Stefanie; Verdin, James P.

    2013-01-01

    Remote sensing datasets are increasingly being used to provide spatially explicit large scale evapotranspiration (ET) estimates. Extensive evaluation of such large scale estimates is necessary before they can be used in various applications. In this study, two monthly MODIS 1 km ET products, MODIS global ET (MOD16) and Operational Simplified Surface Energy Balance (SSEBop) ET, are validated over the conterminous United States at both point and basin scales. Point scale validation was performed using eddy covariance FLUXNET ET (FLET) data (2001–2007) aggregated by year, land cover, elevation and climate zone. Basin scale validation was performed using annual gridded FLUXNET ET (GFET) and annual basin water balance ET (WBET) data aggregated by various hydrologic unit code (HUC) levels. Point scale validation using monthly data aggregated by years revealed that the MOD16 ET and SSEBop ET products showed overall comparable annual accuracies. For most land cover types, both ET products showed comparable results. However, SSEBop showed higher performance for Grassland and Forest classes; MOD16 showed improved performance in the Woody Savanna class. Accuracy of both the ET products was also found to be comparable over different climate zones. However, SSEBop data showed higher skill score across the climate zones covering the western United States. Validation results at different HUC levels over 2000–2011 using GFET as a reference indicate higher accuracies for MOD16 ET data. MOD16, SSEBop and GFET data were validated against WBET (2000–2009), and results indicate that both MOD16 and SSEBop ET matched the accuracies of the global GFET dataset at different HUC levels. Our results indicate that both MODIS ET products effectively reproduced basin scale ET response (up to 25% uncertainty) compared to CONUS-wide point-based ET response (up to 50–60% uncertainty) illustrating the reliability of MODIS ET products for basin-scale ET estimation. Results from this research

  4. Ecosystem carbon and radiative fluxes: a global synthesis based on the FLUXNET network.

    Science.gov (United States)

    Cescatti, A.

    2009-04-01

    Solar radiation is the most important environmental factor driving the temporal and spatial variability of the gross primary productivity (GPP) in terrestrial ecosystems. At the ecosystem scale, the light use efficiency (LUE) depends not only on radiation quantity but also on radiation "quality" both in terms of spectral composition and angular distribution. The day-to-day variations in LUE are largely determined by changes in the ratio of diffuse to total radiation. The relative importance of the concurrent variation in total incoming radiation and in LUE is essential to estimate the sign and the magnitude of the GPP sensitivity to radiation. Despite the scientific relevance of this issue, a global assessment on the sensitivity of GPP to the variations of Phar is still missing. Such an analysis is needed to improve our understanding of the current and future impacts of aerosols and cloud cover on the spatio-temporal variability of GPP. The current availability of ecosystem carbon fluxes, together with separate measurements of incoming direct and diffuse Phar at a large number of flux sites, offers the unique opportunity to extend the previous investigation, both in terms of ecosystem, spatial and climate coverage, and to address questions about the internal (e.g. leaf area index, canopy structure) and external (e.g. cloudiness, covarying meteorology) factors affecting the ecosystem sensitivity to radiation geometry. For this purpose half-hourly measurements of carbon fluxes and radiation have been analyzed at about 220 flux sites for a total of about 660 site-years. This analysis demonstrates that the sensitivity of GPP to incoming radiation varies across the different plant functional types and is correlated with the leaf area index and the local climatology. In particular, the sensitivity of GPP to changes in incoming diffuse light maximizes for the broadleaved forests of the Northern Hemisphere.

  5. Global parameterization and validation of a two-leaf light use efficiency model for predicting gross primary production across FLUXNET sites

    Czech Academy of Sciences Publication Activity Database

    Zhou, Y.; Wu, X.; Weiming, J.; Chen, J.; Wang, S.; Wang, H.; Wenping, Y.; Black, T. A.; Jassal, R.; Ibrom, A.; Han, S.; Yan, J.; Margolis, H.; Roupsard, O.; Li, Y.; Zhao, F.; Kiely, G.; Starr, G.; Pavelka, Marian; Montagnani, L.; Wohlfahrt, G.; D'Odorico, P.; Cook, D.; Altaf Arain, M.; Bonal, D.; Beringer, J.; Blanken, P. D.; Loubet, B.; Leclerc, M. Y.; Matteucci, G.; Nagy, Z.; Olejnik, Janusz; U., K. T. P.; Varlagin, A.

    2016-01-01

    Roč. 36, č. 7 (2016), s. 2743-2760 ISSN 2169-8953 Institutional support: RVO:67179843 Keywords : global parametrization * predicting model * FlUXNET Subject RIV: EH - Ecology, Behaviour Impact factor: 3.395, year: 2016

  6. Intercomparison of MODIS Albedo Retrievals and In Situ Measurements Across the Global FLUXNET Network

    Science.gov (United States)

    Cescatti, Alessandro; Marcolla, Barbara; Vannan, Suresh K. Santhana; Pan, Jerry Yun; Roman, Miguel O.; Yang, Xiaoyuan; Ciais, Philippe; Cook, Robert B.; Law, Beverly E.; Matteucci, Girogio; hide

    2012-01-01

    Surface albedo is a key parameter in the Earth's energy balance since it affects the amount of solar radiation directly absorbed at the planet surface. Its variability in time and space can be globally retrieved through the use of remote sensing products. To evaluate and improve the quality of satellite retrievals, careful intercomparisons with in situ measurements of surface albedo are crucial. For this purpose we compared MODIS albedo retrievals with surface measurements taken at 53 FLUXNET sites that met strict conditions of land cover homogeneity. A good agreement between mean yearly values of satellite retrievals and in situ measurements was found (R(exp 2)= 0.82). The mismatch is correlated to the spatial heterogeneity of surface albedo, stressing the relevance of land cover homogeneity when comparing point to pixel data. When the seasonal patterns of MODIS albedo is considered for different plant functional types, the match with surface observation is extremely good at all forest sites. On the contrary, in non-forest sites satellite retrievals underestimate in situ measurements across the seasonal cycle. The mismatch observed at grasslands and croplands sites is likely due to the extreme fragmentation of these landscapes, as confirmed by geostatistical attributes derived from high resolution scenes.

  7. Seasonal variation of photosynthetic model parameters and leaf area index from global Fluxnet eddy covariance data

    Science.gov (United States)

    Groenendijk, M.; Dolman, A. J.; Ammann, C.; Arneth, A.; Cescatti, A.; Dragoni, D.; Gash, J. H. C.; Gianelle, D.; Gioli, B.; Kiely, G.; Knohl, A.; Law, B. E.; Lund, M.; Marcolla, B.; van der Molen, M. K.; Montagnani, L.; Moors, E.; Richardson, A. D.; Roupsard, O.; Verbeeck, H.; Wohlfahrt, G.

    2011-12-01

    Global vegetation models require the photosynthetic parameters, maximum carboxylation capacity (Vcm), and quantum yield (α) to parameterize their plant functional types (PFTs). The purpose of this work is to determine how much the scaling of the parameters from leaf to ecosystem level through a seasonally varying leaf area index (LAI) explains the parameter variation within and between PFTs. Using Fluxnet data, we simulate a seasonally variable LAIF for a large range of sites, comparable to the LAIM derived from MODIS. There are discrepancies when LAIF reach zero levels and LAIM still provides a small positive value. We find that temperature is the most common constraint for LAIF in 55% of the simulations, while global radiation and vapor pressure deficit are the key constraints for 18% and 27% of the simulations, respectively, while large differences in this forcing still exist when looking at specific PFTs. Despite these differences, the annual photosynthesis simulations are comparable when using LAIF or LAIM (r2 = 0.89). We investigated further the seasonal variation of ecosystem-scale parameters derived with LAIF. Vcm has the largest seasonal variation. This holds for all vegetation types and climates. The parameter α is less variable. By including ecosystem-scale parameter seasonality we can explain a considerable part of the ecosystem-scale parameter variation between PFTs. The remaining unexplained leaf-scale PFT variation still needs further work, including elucidating the precise role of leaf and soil level nitrogen.

  8. Analytical treatment of the relationships between soil heat flux/net radiation ratio and vegetation indices

    International Nuclear Information System (INIS)

    Kustas, W.P.; Daughtry, C.S.T.; Oevelen, P.J. van

    1993-01-01

    Relationships between leaf area index (LAI) and midday soil heat flux/net radiation ratio (G/R n ) and two more commonly used vegetation indices (VIs) were used to analytically derive formulas describing the relationship between G/R n and VI. Use of VI for estimating G/R n may be useful in operational remote sensing models that evaluate the spatial variation in the surface energy balance over large areas. While previous experimental data have shown that linear equations can adequately describe the relationship between G/Rn and VI, this analytical treatment indicated that nonlinear relationships are more appropriate. Data over bare soil and soybeans under a range of canopy cover conditions from a humid climate and data collected over bare soil, alfalfa, and cotton fields in an arid climate were used to evaluate model formulations derived for LAI and G/R n , LAI and VI, and VI and G/R n . In general, equations describing LAI-G/R n and LAI-VI relationships agreed with the data and supported the analytical result of a nonlinear relationship between VI and G/R n . With the simple ratio (NIR/Red) as the VI, the nonlinear relationship with G/R n was confirmed qualitatively. But with the normalized difference vegetation index (NDVI), a nonlinear relationship did not appear to fit the data. (author)

  9. The Synthesis Approach to Analysing Educational Design Dataset: Application of Three Scaffolds to a Learning by Design Task for Postgraduate Education Students

    Science.gov (United States)

    Thompson, Kate; Carvalho, Lucila; Aditomo, Anindito; Dimitriadis, Yannis; Dyke, Gregory; Evans, Michael A.; Khosronejad, Maryam; Martinez-Maldonado, Roberto; Reimann, Peter; Wardak, Dewa

    2015-01-01

    The aims of the Synthesis and Scaffolding Project were to understand: the role of specific scaffolds in relation to the activity of learners, and the activity of learners during a collaborative design task from multiple perspectives, through the collection and analysis of multiple streams of data and the adoption of a synthesis approach to the…

  10. Evaluation of Daytime Evaporative Fraction from MODIS TOA Radiances Using FLUXNET Observations

    Directory of Open Access Journals (Sweden)

    Jian Peng

    2014-06-01

    Full Text Available In recent decades, the land surface temperature/vegetation index (LST/NDVI feature space has been widely used to estimate actual evapotranspiration (ETa or evaporative fraction (EF, defined as the ratio of latent heat flux to surface available energy. Traditionally, it is essential to pre-process satellite top of atmosphere (TOA radiances to obtain LST before estimating EF. However, pre-processing TOA radiances is a cumbersome task including corrections for atmospheric, adjacency and directional effects. Based on the contextual relationship between LST and NDVI, some studies proposed the direct use of TOA radiances instead of satellite retrieved LST products to estimate EF, and found that use of TOA radiances is applicable in some regional studies. The purpose of the present study is to test the robustness of the TOA radiances based EF estimation scheme over different climatic and surface conditions. Flux measurements from 16 FLUXNET (a global network of eddy covariance towers sites were used to validate the Moderate Resolution Imaging Spectro radiometer (MODIS TOA radiances estimated daytime EF. It is found that the EF estimates perform well across a wide variety of climate and biome types—Grasslands, crops, cropland/natural vegetation mosaic, closed shrublands, mixed forest, deciduous broadleaf forest, and savannas. The overall mean bias error (BIAS, mean absolute difference (MAD, root mean square difference (RMSD and correlation coefficient (R values for all the sites are 0.018, 0.147, 0.178 and 0.590, respectively, which are comparable with published results in the literature. We conclude that the direct use of measured TOA radiances instead of LST to estimate daytime EF can avoid complex atmospheric corrections associated with the satellite derived products, and would facilitate the relevant applications where minimum pre-processing is important.

  11. Water-stress-induced breakdown of carbon-water relations: indicators from diurnal FLUXNET patterns

    Science.gov (United States)

    Nelson, Jacob A.; Carvalhais, Nuno; Migliavacca, Mirco; Reichstein, Markus; Jung, Martin

    2018-04-01

    Understanding of terrestrial carbon and water cycles is currently hampered by an uncertainty in how to capture the large variety of plant responses to drought. In FLUXNET, the global network of CO2 and H2O flux observations, many sites do not uniformly report the ancillary variables needed to study drought response physiology. To this end, we outline two data-driven indicators based on diurnal energy, water, and carbon flux patterns derived directly from the eddy covariance data and based on theorized physiological responses to hydraulic and non-stomatal limitations. Hydraulic limitations (i.e. intra-plant limitations on water movement) are proxied using the relative diurnal centroid (CET*), which measures the degree to which the flux of evapotranspiration (ET) is shifted toward the morning. Non-stomatal limitations (e.g. inhibitions of biochemical reactions, RuBisCO activity, and/or mesophyll conductance) are characterized by the Diurnal Water-Carbon Index (DWCI), which measures the degree of coupling between ET and gross primary productivity (GPP) within each day. As a proof of concept we show the response of the metrics at six European sites during the 2003 heat wave event, showing a varied response of morning shifts and decoupling. Globally, we found indications of hydraulic limitations in the form of significantly high frequencies of morning-shifted days in dry/Mediterranean climates and savanna/evergreen plant functional types (PFTs), whereas high frequencies of decoupling were dominated by dry climates and grassland/savanna PFTs indicating a prevalence of non-stomatal limitations in these ecosystems. Overall, both the diurnal centroid and DWCI were associated with high net radiation and low latent energy typical of drought. Using three water use efficiency (WUE) models, we found the mean differences between expected and observed WUE to be -0.09 to 0.44 µmol mmol-1 and -0.29 to -0.40 µmol mmol-1 for decoupled and morning-shifted days, respectively, compared

  12. EPA Nanorelease Dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — EPA Nanorelease Dataset. This dataset is associated with the following publication: Wohlleben, W., C. Kingston, J. Carter, E. Sahle-Demessie, S. Vazquez-Campos, B....

  13. A data-driven analysis of energy balance closure across FLUXNET research sites: The role of landscape scale heterogeneity

    DEFF Research Database (Denmark)

    Stoy, Paul C.; Mauder, Matthias; Foken, Thomas

    2013-01-01

    approached 1. These results suggest that landscape-level heterogeneity in vegetation and topography cannot be ignored as a contributor to incomplete energy balance closure at the flux network level, although net radiation measurements, biological energy assimilation, unmeasured storage terms......The energy balance at most surface-atmosphere flux research sites remains unclosed. The mechanisms underlying the discrepancy between measured energy inputs and outputs across the global FLUXNET tower network are still under debate. Recent reviews have identified exchange processes and turbulent...... motions at large spatial and temporal scales in heterogeneous landscapes as the primary cause of the lack of energy balance closure at some intensively-researched sites, while unmeasured storage terms cannot be ruled out as a dominant contributor to the lack of energy balance closure at many other sites...

  14. FLUXNET: A new tool to study the temporal and spatial variability of ecosystem-scale carbon dioxide, water vapor, and energy flux densities

    DEFF Research Database (Denmark)

    Baldocchi, D.; Falge, E.; Gu, L.

    2001-01-01

    FLUXNET is a global network of micrometeorological flux measurement site's that measure the exchanges of carbon dioxide, water vapor, and energy between the biosphere and atmosphere. At present over 140 sites are operating on a long-term and continuous basis. Vegetation under study includes...... of annual ecosystem carbon and water balances, to quantify the response of stand-scale carbon dioxide and water vapor flux densities to controlling biotic and abiotic factors, and to validate a hierarchy of soil-plant-atmosphere trace gas exchange models. Findings so far include 1) net CO2 exchange......, it provides infrastructure for compiling, archiving, and distributing carbon, water, and energy flux measurement, and meteorological, plant, and soil data to the science community. (Data and site information are available online at the FLUXNET Web site, http://www-eosdis.oml.gov/FLUXNTET/.) Second...

  15. Net ecosystem productivity of temperate and boreal forests after clearcutting a Fluxnet-Canada measurement and modelling synthesis

    Energy Technology Data Exchange (ETDEWEB)

    Grant, R. F. (Dept. of Renewable Resources, Univ. of Alberta, Edmonton, (Canada)), e-mail: robert.grant@ales.ualberta.ca; Barr, A. G. (Climate Research Branch, Meteorological Service of Canada, Saskatoon (Canada)); Black, T. A. (Faculty of Land and Food Systems, Univ. of British Columbia, Vancouver BC, (Canada)); Margolis, H. A. (Faculte de Foresterie et de Geomatique, Pavillon Abitibi-Price, Universite Laval, Quebec (Canada)); McCaughey, J. H. (Dept. of Geography, Queen' s Univ., Kingston (Canada)); Trofymow, J. A. (Canadian Forest Service, Pacific Forestry Centre, Victoria (Canada))

    2010-11-15

    Clearcutting strongly affects subsequent forest net ecosystem productivity (NEP). Hypotheses for ecological controls on NEP in the ecosystem model ecosys were tested with CO{sub 2} fluxes measured by eddy covariance (EC) in three post clearcut conifer chronosequences in different ecological zones across Canada. In the model, microbial colonization of postharvest fine and woody debris drove heterotrophic respiration (Rh), and hence decomposition, microbial growth, N mineralization and asymbiotic N{sub 2} fixation. These processes controlled root N uptake, and thereby CO{sub 2} fixation in regrowing vegetation. Interactions among soil and plant processes allowed the model to simulate hourly CO{sub 2} fluxes and annual NEP within the uncertainty of EC measurements from 2003 to 2007 over forest stands from 1 to 80 yr of age in all three chronosequences without site- or species-specific parameterization. The model was then used to study the impacts of increasing harvest removals on subsequent C stocks at one of the chronosequence sites. Model results indicated that increasing harvest removals would hasten recovery of NEP during the first 30 yr after clearcutting, but would reduce ecosystem C stocks by about 15% of the increased removals at the end of an 80-yr harvest cycle

  16. Aaron Journal article datasets

    Data.gov (United States)

    U.S. Environmental Protection Agency — All figures used in the journal article are in netCDF format. This dataset is associated with the following publication: Sims, A., K. Alapaty , and S. Raman....

  17. Integrated Surface Dataset (Global)

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — The Integrated Surface (ISD) Dataset (ISD) is composed of worldwide surface weather observations from over 35,000 stations, though the best spatial coverage is...

  18. Control Measure Dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — The EPA Control Measure Dataset is a collection of documents describing air pollution control available to regulated facilities for the control and abatement of air...

  19. National Hydrography Dataset (NHD)

    Data.gov (United States)

    Kansas Data Access and Support Center — The National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that comprise the...

  20. Market Squid Ecology Dataset

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — This dataset contains ecological information collected on the major adult spawning and juvenile habitats of market squid off California and the US Pacific Northwest....

  1. Tables and figure datasets

    Data.gov (United States)

    U.S. Environmental Protection Agency — Soil and air concentrations of asbestos in Sumas study. This dataset is associated with the following publication: Wroble, J., T. Frederick, A. Frame, and D....

  2. Isfahan MISP Dataset.

    Science.gov (United States)

    Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

    2017-01-01

    An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).

  3. Mridangam stroke dataset

    OpenAIRE

    CompMusic

    2014-01-01

    The audio examples were recorded from a professional Carnatic percussionist in a semi-anechoic studio conditions by Akshay Anantapadmanabhan using SM-58 microphones and an H4n ZOOM recorder. The audio was sampled at 44.1 kHz and stored as 16 bit wav files. The dataset can be used for training models for each Mridangam stroke. /n/nA detailed description of the Mridangam and its strokes can be found in the paper below. A part of the dataset was used in the following paper. /nAkshay Anantapadman...

  4. The GTZAN dataset

    DEFF Research Database (Denmark)

    Sturm, Bob L.

    2013-01-01

    The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge...... of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN...

  5. Dataset - Adviesregel PPL 2010

    NARCIS (Netherlands)

    Evert, van F.K.; Schans, van der D.A.; Geel, van W.C.A.; Slabbekoorn, J.J.; Booij, R.; Jukema, J.N.; Meurs, E.J.J.; Uenk, D.

    2011-01-01

    This dataset contains experimental data from a number of field experiments with potato in The Netherlands (Van Evert et al., 2011). The data are presented as an SQL dump of a PostgreSQL database (version 8.4.4). An outline of the entity-relationship diagram of the database is given in an

  6. National Elevation Dataset

    Science.gov (United States)

    ,

    2002-01-01

    The National Elevation Dataset (NED) is a new raster product assembled by the U.S. Geological Survey. NED is designed to provide National elevation data in a seamless form with a consistent datum, elevation unit, and projection. Data corrections were made in the NED assembly process to minimize artifacts, perform edge matching, and fill sliver areas of missing data. NED has a resolution of one arc-second (approximately 30 meters) for the conterminous United States, Hawaii, Puerto Rico and the island territories and a resolution of two arc-seconds for Alaska. NED data sources have a variety of elevation units, horizontal datums, and map projections. In the NED assembly process the elevation values are converted to decimal meters as a consistent unit of measure, NAD83 is consistently used as horizontal datum, and all the data are recast in a geographic projection. Older DEM's produced by methods that are now obsolete have been filtered during the NED assembly process to minimize artifacts that are commonly found in data produced by these methods. Artifact removal greatly improves the quality of the slope, shaded-relief, and synthetic drainage information that can be derived from the elevation data. Figure 2 illustrates the results of this artifact removal filtering. NED processing also includes steps to adjust values where adjacent DEM's do not match well, and to fill sliver areas of missing data between DEM's. These processing steps ensure that NED has no void areas and artificial discontinuities have been minimized. The artifact removal filtering process does not eliminate all of the artifacts. In areas where the only available DEM is produced by older methods, then "striping" may still occur.

  7. NP-PAH Interaction Dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration...

  8. Editorial: Datasets for Learning Analytics

    NARCIS (Netherlands)

    Dietze, Stefan; George, Siemens; Davide, Taibi; Drachsler, Hendrik

    2018-01-01

    The European LinkedUp and LACE (Learning Analytics Community Exchange) project have been responsible for setting up a series of data challenges at the LAK conferences 2013 and 2014 around the LAK dataset. The LAK datasets consists of a rich collection of full text publications in the domain of

  9. Open University Learning Analytics dataset.

    Science.gov (United States)

    Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek

    2017-11-28

    Learning Analytics focuses on the collection and analysis of learners' data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students' interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.

  10. Turkey Run Landfill Emissions Dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — landfill emissions measurements for the Turkey run landfill in Georgia. This dataset is associated with the following publication: De la Cruz, F., R. Green, G....

  11. Dataset of NRDA emission data

    Data.gov (United States)

    U.S. Environmental Protection Agency — Emissions data from open air oil burns. This dataset is associated with the following publication: Gullett, B., J. Aurell, A. Holder, B. Mitchell, D. Greenwell, M....

  12. Chemical product and function dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — Merged product weight fraction and chemical function data. This dataset is associated with the following publication: Isaacs , K., M. Goldsmith, P. Egeghy , K....

  13. Global parameterization and validation of a two-leaf light use efficiency model for predicting gross primary production across FLUXNET sites

    DEFF Research Database (Denmark)

    Zhou, Yanlian; Wu, Xiaocui; Ju, Weimin

    2015-01-01

    Light use efficiency (LUE) models are widely used to simulate gross primary production (GPP). However, the treatment of the plant canopy as a big leaf by these models can introduce large uncertainties in simulated GPP. Recently, a two-leaf light use efficiency (TL-LUE) model was developed...... to simulate GPP separately for sunlit and shaded leaves and has been shown to outperform the big-leaf MOD17 model at six FLUX sites in China. In this study we investigated the performance of the TL-LUE model for a wider range of biomes. For this we optimized the parameters and tested the TL-LUE model using...... data from 98 FLUXNET sites which are distributed across the globe. The results showed that the TL-LUE model performed in general better than the MOD17 model in simulating 8 day GPP. Optimized maximum light use efficiency of shaded leaves (epsilon(msh)) was 2.63 to 4.59 times that of sunlit leaves...

  14. The NOAA Dataset Identifier Project

    Science.gov (United States)

    de la Beaujardiere, J.; Mccullough, H.; Casey, K. S.

    2013-12-01

    The US National Oceanic and Atmospheric Administration (NOAA) initiated a project in 2013 to assign persistent identifiers to datasets archived at NOAA and to create informational landing pages about those datasets. The goals of this project are to enable the citation of datasets used in products and results in order to help provide credit to data producers, to support traceability and reproducibility, and to enable tracking of data usage and impact. A secondary goal is to encourage the submission of datasets for long-term preservation, because only archived datasets will be eligible for a NOAA-issued identifier. A team was formed with representatives from the National Geophysical, Oceanographic, and Climatic Data Centers (NGDC, NODC, NCDC) to resolve questions including which identifier scheme to use (answer: Digital Object Identifier - DOI), whether or not to embed semantics in identifiers (no), the level of granularity at which to assign identifiers (as coarsely as reasonable), how to handle ongoing time-series data (do not break into chunks), creation mechanism for the landing page (stylesheet from formal metadata record preferred), and others. Decisions made and implementation experience gained will inform the writing of a Data Citation Procedural Directive to be issued by the Environmental Data Management Committee in 2014. Several identifiers have been issued as of July 2013, with more on the way. NOAA is now reporting the number as a metric to federal Open Government initiatives. This paper will provide further details and status of the project.

  15. The Harvard organic photovoltaic dataset.

    Science.gov (United States)

    Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán

    2016-09-27

    The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

  16. The Harvard organic photovoltaic dataset

    Science.gov (United States)

    Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán

    2016-01-01

    The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312

  17. Querying Large Biological Network Datasets

    Science.gov (United States)

    Gulsoy, Gunhan

    2013-01-01

    New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…

  18. Estimation of the soil heat flux/net radiation ratio based on spectral vegetation indexes in high-latitude Arctic areas

    International Nuclear Information System (INIS)

    Jacobsen, A.; Hansen, B.U.

    1999-01-01

    The vegetation communities in the Arctic environment are very sensitive to even minor climatic variations and therefore the estimation of surface energy fluxes from high-latitude vegetated areas is an important subject to be pursued. This study was carried out in July-August and used micro meteorological data, spectral reflectance signatures, and vegetation biomass to establish the relation between the soil heat flux/net radiation (G / Rn) ratio and spectral vegetation indices (SVIs). Continuous measurements of soil temperature and soil heat flux were used to calculate the surface ground heat flux by use of conventional methods, and the relation to surface temperature was investigated. Twenty-seven locations were established, and six samples per location, including the measurement of the surface temperature and net radiation to establish the G/Rn ratio and simultaneous spectral reflectance signatures and wet biomass estimates, were registered. To obtain regional reliability, the locations were chosen in order to represent the different Arctic vegetation communities in the study area; ranging from dry tundra vegetation communities (fell fields and dry dwarf scrubs) to moist/wet tundra vegetation communities (snowbeds, grasslands and fens). Spectral vegetation indices, including the simple ratio vegetation index (RVI) and the normalized difference vegetation index (NDVI), were calculated. A comparison of SVIs to biomass proved that RVI gave the best linear expression, and NDVI the best exponential expression. A comparison of SVIs and the surface energy flux ratio G / Rn proved that NDVI gave the best linear expression. SPOT HRV images from July 1989 and 1992 were used to map NDVI and G / Rn at a regional scale. (author)

  19. Global parameterization and validation of a two-leaf light use efficiency model for predicting gross primary production across FLUXNET sites: TL-LUE Parameterization and Validation

    Energy Technology Data Exchange (ETDEWEB)

    Zhou, Yanlian [Jiangsu Provincial Key Laboratory of Geographic Information Science and Technology, School of Geographic and Oceanographic Sciences, Nanjing University, Nanjing China; Joint Center for Global Change Studies, Beijing China; Wu, Xiaocui [International Institute for Earth System Sciences, Nanjing University, Nanjing China; Joint Center for Global Change Studies, Beijing China; Ju, Weimin [International Institute for Earth System Sciences, Nanjing University, Nanjing China; Jiangsu Center for Collaborative Innovation in Geographic Information Resource Development and Application, Nanjing China; Chen, Jing M. [International Institute for Earth System Sciences, Nanjing University, Nanjing China; Joint Center for Global Change Studies, Beijing China; Wang, Shaoqiang [Key Laboratory of Ecosystem Network Observation and Modeling, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Science, Beijing China; Wang, Huimin [Key Laboratory of Ecosystem Network Observation and Modeling, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Science, Beijing China; Yuan, Wenping [State Key Laboratory of Earth Surface Processes and Resource Ecology, Future Earth Research Institute, Beijing Normal University, Beijing China; Andrew Black, T. [Faculty of Land and Food Systems, University of British Columbia, Vancouver British Columbia Canada; Jassal, Rachhpal [Faculty of Land and Food Systems, University of British Columbia, Vancouver British Columbia Canada; Ibrom, Andreas [Department of Environmental Engineering, Technical University of Denmark (DTU), Kgs. Lyngby Denmark; Han, Shijie [Institute of Applied Ecology, Chinese Academy of Sciences, Shenyang China; Yan, Junhua [South China Botanical Garden, Chinese Academy of Sciences, Guangzhou China; Margolis, Hank [Centre for Forest Studies, Faculty of Forestry, Geography and Geomatics, Laval University, Quebec City Quebec Canada; Roupsard, Olivier [CIRAD-Persyst, UMR Ecologie Fonctionnelle and Biogéochimie des Sols et Agroécosystèmes, SupAgro-CIRAD-INRA-IRD, Montpellier France; CATIE (Tropical Agricultural Centre for Research and Higher Education), Turrialba Costa Rica; Li, Yingnian [Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining China; Zhao, Fenghua [Key Laboratory of Ecosystem Network Observation and Modeling, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Science, Beijing China; Kiely, Gerard [Environmental Research Institute, Civil and Environmental Engineering Department, University College Cork, Cork Ireland; Starr, Gregory [Department of Biological Sciences, University of Alabama, Tuscaloosa Alabama USA; Pavelka, Marian [Laboratory of Plants Ecological Physiology, Institute of Systems Biology and Ecology AS CR, Prague Czech Republic; Montagnani, Leonardo [Forest Services, Autonomous Province of Bolzano, Bolzano Italy; Faculty of Sciences and Technology, Free University of Bolzano, Bolzano Italy; Wohlfahrt, Georg [Institute for Ecology, University of Innsbruck, Innsbruck Austria; European Academy of Bolzano, Bolzano Italy; D' Odorico, Petra [Grassland Sciences Group, Institute of Agricultural Sciences, ETH Zurich Switzerland; Cook, David [Atmospheric and Climate Research Program, Environmental Science Division, Argonne National Laboratory, Argonne Illinois USA; Arain, M. Altaf [McMaster Centre for Climate Change and School of Geography and Earth Sciences, McMaster University, Hamilton Ontario Canada; Bonal, Damien [INRA Nancy, UMR EEF, Champenoux France; Beringer, Jason [School of Earth and Environment, The University of Western Australia, Crawley Australia; Blanken, Peter D. [Department of Geography, University of Colorado Boulder, Boulder Colorado USA; Loubet, Benjamin [UMR ECOSYS, INRA, AgroParisTech, Université Paris-Saclay, Thiverval-Grignon France; Leclerc, Monique Y. [Department of Crop and Soil Sciences, College of Agricultural and Environmental Sciences, University of Georgia, Athens Georgia USA; Matteucci, Giorgio [Viea San Camillo Ed LellisViterbo, University of Tuscia, Viterbo Italy; Nagy, Zoltan [MTA-SZIE Plant Ecology Research Group, Szent Istvan University, Godollo Hungary; Olejnik, Janusz [Meteorology Department, Poznan University of Life Sciences, Poznan Poland; Department of Matter and Energy Fluxes, Global Change Research Center, Brno Czech Republic; Paw U, Kyaw Tha [Department of Land, Air and Water Resources, University of California, Davis California USA; Joint Program on the Science and Policy of Global Change, Massachusetts Institute of Technology, Cambridge USA; Varlagin, Andrej [A.N. Severtsov Institute of Ecology and Evolution, Russian Academy of Sciences, Moscow Russia

    2016-04-06

    Light use efficiency (LUE) models are widely used to simulate gross primary production (GPP). However, the treatment of the plant canopy as a big leaf by these models can introduce large uncertainties in simulated GPP. Recently, a two-leaf light use efficiency (TL-LUE) model was developed to simulate GPP separately for sunlit and shaded leaves and has been shown to outperform the big-leaf MOD17 model at 6 FLUX sites in China. In this study we investigated the performance of the TL-LUE model for a wider range of biomes. For this we optimized the parameters and tested the TL-LUE model using data from 98 FLUXNET sites which are distributed across the globe. The results showed that the TL-LUE model performed in general better than the MOD17 model in simulating 8-day GPP. Optimized maximum light use efficiency of shaded leaves (εmsh) was 2.63 to 4.59 times that of sunlit leaves (εmsu). Generally, the relationships of εmsh and εmsu with εmax were well described by linear equations, indicating the existence of general patterns across biomes. GPP simulated by the TL-LUE model was much less sensitive to biases in the photosynthetically active radiation (PAR) input than the MOD17 model. The results of this study suggest that the proposed TL-LUE model has the potential for simulating regional and global GPP of terrestrial ecosystems and it is more robust with regard to usual biases in input data than existing approaches which neglect the bi-modal within-canopy distribution of PAR.

  20. CERC Dataset (Full Hadza Data)

    DEFF Research Database (Denmark)

    2016-01-01

    The dataset includes demographic, behavioral, and religiosity data from eight different populations from around the world. The samples were drawn from: (1) Coastal and (2) Inland Tanna, Vanuatu; (3) Hadzaland, Tanzania; (4) Lovu, Fiji; (5) Pointe aux Piment, Mauritius; (6) Pesqueiro, Brazil; (7......) Kyzyl, Tyva Republic; and (8) Yasawa, Fiji. Related publication: Purzycki, et al. (2016). Moralistic Gods, Supernatural Punishment and the Expansion of Human Sociality. Nature, 530(7590): 327-330....

  1. Viking Seismometer PDS Archive Dataset

    Science.gov (United States)

    Lorenz, R. D.

    2016-12-01

    The Viking Lander 2 seismometer operated successfully for over 500 Sols on the Martian surface, recording at least one likely candidate Marsquake. The Viking mission, in an era when data handling hardware (both on board and on the ground) was limited in capability, predated modern planetary data archiving, and ad-hoc repositories of the data, and the very low-level record at NSSDC, were neither convenient to process nor well-known. In an effort supported by the NASA Mars Data Analysis Program, we have converted the bulk of the Viking dataset (namely the 49,000 and 270,000 records made in High- and Event- modes at 20 and 1 Hz respectively) into a simple ASCII table format. Additionally, since wind-generated lander motion is a major component of the signal, contemporaneous meteorological data are included in summary records to facilitate correlation. These datasets are being archived at the PDS Geosciences Node. In addition to brief instrument and dataset descriptions, the archive includes code snippets in the freely-available language 'R' to demonstrate plotting and analysis. Further, we present examples of lander-generated noise, associated with the sampler arm, instrument dumps and other mechanical operations.

  2. PHYSICS PERFORMANCE AND DATASET (PPD)

    CERN Multimedia

    L. Silvestris

    2013-01-01

    The first part of the Long Shutdown period has been dedicated to the preparation of the samples for the analysis targeting the summer conferences. In particular, the 8 TeV data acquired in 2012, including most of the “parked datasets”, have been reconstructed profiting from improved alignment and calibration conditions for all the sub-detectors. A careful planning of the resources was essential in order to deliver the datasets well in time to the analysts, and to schedule the update of all the conditions and calibrations needed at the analysis level. The newly reprocessed data have undergone detailed scrutiny by the Dataset Certification team allowing to recover some of the data for analysis usage and further improving the certification efficiency, which is now at 91% of the recorded luminosity. With the aim of delivering a consistent dataset for 2011 and 2012, both in terms of conditions and release (53X), the PPD team is now working to set up a data re-reconstruction and a new MC pro...

  3. RARD: The Related-Article Recommendation Dataset

    OpenAIRE

    Beel, Joeran; Carevic, Zeljko; Schaible, Johann; Neusch, Gabor

    2017-01-01

    Recommender-system datasets are used for recommender-system evaluations, training machine-learning algorithms, and exploring user behavior. While there are many datasets for recommender systems in the domains of movies, books, and music, there are rather few datasets from research-paper recommender systems. In this paper, we introduce RARD, the Related-Article Recommendation Dataset, from the digital library Sowiport and the recommendation-as-a-service provider Mr. DLib. The dataset contains ...

  4. Passive Containment DataSet

    Science.gov (United States)

    This data is for Figures 6 and 7 in the journal article. The data also includes the two EPANET input files used for the analysis described in the paper, one for the looped system and one for the block system.This dataset is associated with the following publication:Grayman, W., R. Murray , and D. Savic. Redesign of Water Distribution Systems for Passive Containment of Contamination. JOURNAL OF THE AMERICAN WATER WORKS ASSOCIATION. American Water Works Association, Denver, CO, USA, 108(7): 381-391, (2016).

  5. The CMS dataset bookkeeping service

    Science.gov (United States)

    Afaq, A.; Dolgert, A.; Guo, Y.; Jones, C.; Kosyakov, S.; Kuznetsov, V.; Lueking, L.; Riley, D.; Sekhri, V.

    2008-07-01

    The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.

  6. The CMS dataset bookkeeping service

    Energy Technology Data Exchange (ETDEWEB)

    Afaq, A; Guo, Y; Kosyakov, S; Lueking, L; Sekhri, V [Fermilab, Batavia, Illinois 60510 (United States); Dolgert, A; Jones, C; Kuznetsov, V; Riley, D [Cornell University, Ithaca, New York 14850 (United States)

    2008-07-15

    The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.

  7. The CMS dataset bookkeeping service

    International Nuclear Information System (INIS)

    Afaq, A; Guo, Y; Kosyakov, S; Lueking, L; Sekhri, V; Dolgert, A; Jones, C; Kuznetsov, V; Riley, D

    2008-01-01

    The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems

  8. The CMS dataset bookkeeping service

    International Nuclear Information System (INIS)

    Afaq, Anzar; Dolgert, Andrew; Guo, Yuyi; Jones, Chris; Kosyakov, Sergey; Kuznetsov, Valentin; Lueking, Lee; Riley, Dan; Sekhri, Vijay

    2007-01-01

    The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems

  9. 2008 TIGER/Line Nationwide Dataset

    Data.gov (United States)

    California Natural Resource Agency — This dataset contains a nationwide build of the 2008 TIGER/Line datasets from the US Census Bureau downloaded in April 2009. The TIGER/Line Shapefiles are an extract...

  10. Satellite-Based Precipitation Datasets

    Science.gov (United States)

    Munchak, S. J.; Huffman, G. J.

    2017-12-01

    Of the possible sources of precipitation data, those based on satellites provide the greatest spatial coverage. There is a wide selection of datasets, algorithms, and versions from which to choose, which can be confusing to non-specialists wishing to use the data. The International Precipitation Working Group (IPWG) maintains tables of the major publicly available, long-term, quasi-global precipitation data sets (http://www.isac.cnr.it/ ipwg/data/datasets.html), and this talk briefly reviews the various categories. As examples, NASA provides two sets of quasi-global precipitation data sets: the older Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) and current Integrated Multi-satellitE Retrievals for Global Precipitation Measurement (GPM) mission (IMERG). Both provide near-real-time and post-real-time products that are uniformly gridded in space and time. The TMPA products are 3-hourly 0.25°x0.25° on the latitude band 50°N-S for about 16 years, while the IMERG products are half-hourly 0.1°x0.1° on 60°N-S for over 3 years (with plans to go to 16+ years in Spring 2018). In addition to the precipitation estimates, each data set provides fields of other variables, such as the satellite sensor providing estimates and estimated random error. The discussion concludes with advice about determining suitability for use, the necessity of being clear about product names and versions, and the need for continued support for satellite- and surface-based observation.

  11. PHYSICS PERFORMANCE AND DATASET (PPD)

    CERN Multimedia

    L. Silvestris

    2012-01-01

      Introduction The first part of the year presented an important test for the new Physics Performance and Dataset (PPD) group (cf. its mandate: http://cern.ch/go/8f77). The activity was focused on the validation of the new releases meant for the Monte Carlo (MC) production and the data-processing in 2012 (CMSSW 50X and 52X), and on the preparation of the 2012 operations. In view of the Chamonix meeting, the PPD and physics groups worked to understand the impact of the higher pile-up scenario on some of the flagship Higgs analyses to better quantify the impact of the high luminosity on the CMS physics potential. A task force is working on the optimisation of the reconstruction algorithms and on the code to cope with the performance requirements imposed by the higher event occupancy as foreseen for 2012. Concerning the preparation for the analysis of the new data, a new MC production has been prepared. The new samples, simulated at 8 TeV, are already being produced and the digitisation and recons...

  12. Pattern Analysis On Banking Dataset

    Directory of Open Access Journals (Sweden)

    Amritpal Singh

    2015-06-01

    Full Text Available Abstract Everyday refinement and development of technology has led to an increase in the competition between the Tech companies and their going out of way to crack the system andbreak down. Thus providing Data mining a strategically and security-wise important area for many business organizations including banking sector. It allows the analyzes of important information in the data warehouse and assists the banks to look for obscure patterns in a group and discover unknown relationship in the data.Banking systems needs to process ample amount of data on daily basis related to customer information their credit card details limit and collateral details transaction details risk profiles Anti Money Laundering related information trade finance data. Thousands of decisionsbased on the related data are taken in a bank daily. This paper analyzes the banking dataset in the weka environment for the detection of interesting patterns based on its applications ofcustomer acquisition customer retention management and marketing and management of risk fraudulence detections.

  13. PHYSICS PERFORMANCE AND DATASET (PPD)

    CERN Multimedia

    L. Silvestris

    2013-01-01

    The PPD activities, in the first part of 2013, have been focused mostly on the final physics validation and preparation for the data reprocessing of the full 8 TeV datasets with the latest calibrations. These samples will be the basis for the preliminary results for summer 2013 but most importantly for the final publications on the 8 TeV Run 1 data. The reprocessing involves also the reconstruction of a significant fraction of “parked data” that will allow CMS to perform a whole new set of precision analyses and searches. In this way the CMSSW release 53X is becoming the legacy release for the 8 TeV Run 1 data. The regular operation activities have included taking care of the prolonged proton-proton data taking and the run with proton-lead collisions that ended in February. The DQM and Data Certification team has deployed a continuous effort to promptly certify the quality of the data. The luminosity-weighted certification efficiency (requiring all sub-detectors to be certified as usab...

  14. The Geometry of Finite Equilibrium Datasets

    DEFF Research Database (Denmark)

    Balasko, Yves; Tvede, Mich

    We investigate the geometry of finite datasets defined by equilibrium prices, income distributions, and total resources. We show that the equilibrium condition imposes no restrictions if total resources are collinear, a property that is robust to small perturbations. We also show that the set...... of equilibrium datasets is pathconnected when the equilibrium condition does impose restrictions on datasets, as for example when total resources are widely non collinear....

  15. IPCC Socio-Economic Baseline Dataset

    Data.gov (United States)

    National Aeronautics and Space Administration — The Intergovernmental Panel on Climate Change (IPCC) Socio-Economic Baseline Dataset consists of population, human development, economic, water resources, land...

  16. Veterans Affairs Suicide Prevention Synthetic Dataset

    Data.gov (United States)

    Department of Veterans Affairs — The VA's Veteran Health Administration, in support of the Open Data Initiative, is providing the Veterans Affairs Suicide Prevention Synthetic Dataset (VASPSD). The...

  17. Nanoparticle-organic pollutant interaction dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration...

  18. An Annotated Dataset of 14 Meat Images

    DEFF Research Database (Denmark)

    Stegmann, Mikkel Bille

    2002-01-01

    This note describes a dataset consisting of 14 annotated images of meat. Points of correspondence are placed on each image. As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given.......This note describes a dataset consisting of 14 annotated images of meat. Points of correspondence are placed on each image. As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given....

  19. SIMADL: Simulated Activities of Daily Living Dataset

    Directory of Open Access Journals (Sweden)

    Talal Alshammari

    2018-04-01

    Full Text Available With the realisation of the Internet of Things (IoT paradigm, the analysis of the Activities of Daily Living (ADLs, in a smart home environment, is becoming an active research domain. The existence of representative datasets is a key requirement to advance the research in smart home design. Such datasets are an integral part of the visualisation of new smart home concepts as well as the validation and evaluation of emerging machine learning models. Machine learning techniques that can learn ADLs from sensor readings are used to classify, predict and detect anomalous patterns. Such techniques require data that represent relevant smart home scenarios, for training, testing and validation. However, the development of such machine learning techniques is limited by the lack of real smart home datasets, due to the excessive cost of building real smart homes. This paper provides two datasets for classification and anomaly detection. The datasets are generated using OpenSHS, (Open Smart Home Simulator, which is a simulation software for dataset generation. OpenSHS records the daily activities of a participant within a virtual environment. Seven participants simulated their ADLs for different contexts, e.g., weekdays, weekends, mornings and evenings. Eighty-four files in total were generated, representing approximately 63 days worth of activities. Forty-two files of classification of ADLs were simulated in the classification dataset and the other forty-two files are for anomaly detection problems in which anomalous patterns were simulated and injected into the anomaly detection dataset.

  20. ASSISTments Dataset from Multiple Randomized Controlled Experiments

    Science.gov (United States)

    Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil

    2016-01-01

    In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…

  1. Synthetic and Empirical Capsicum Annuum Image Dataset

    NARCIS (Netherlands)

    Barth, R.

    2016-01-01

    This dataset consists of per-pixel annotated synthetic (10500) and empirical images (50) of Capsicum annuum, also known as sweet or bell pepper, situated in a commercial greenhouse. Furthermore, the source models to generate the synthetic images are included. The aim of the datasets are to

  2. Design of an audio advertisement dataset

    Science.gov (United States)

    Fu, Yutao; Liu, Jihong; Zhang, Qi; Geng, Yuting

    2015-12-01

    Since more and more advertisements swarm into radios, it is necessary to establish an audio advertising dataset which could be used to analyze and classify the advertisement. A method of how to establish a complete audio advertising dataset is presented in this paper. The dataset is divided into four different kinds of advertisements. Each advertisement's sample is given in *.wav file format, and annotated with a txt file which contains its file name, sampling frequency, channel number, broadcasting time and its class. The classifying rationality of the advertisements in this dataset is proved by clustering the different advertisements based on Principal Component Analysis (PCA). The experimental results show that this audio advertisement dataset offers a reliable set of samples for correlative audio advertisement experimental studies.

  3. The Kinetics Human Action Video Dataset

    OpenAIRE

    Kay, Will; Carreira, Joao; Simonyan, Karen; Zhang, Brian; Hillier, Chloe; Vijayanarasimhan, Sudheendra; Viola, Fabio; Green, Tim; Back, Trevor; Natsev, Paul; Suleyman, Mustafa; Zisserman, Andrew

    2017-01-01

    We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some ...

  4. BASE MAP DATASET, LOS ANGELES COUNTY, CALIFORNIA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...

  5. BASE MAP DATASET, CHEROKEE COUNTY, SOUTH CAROLINA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...

  6. SIAM 2007 Text Mining Competition dataset

    Data.gov (United States)

    National Aeronautics and Space Administration — Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining...

  7. Harvard Aging Brain Study : Dataset and accessibility

    NARCIS (Netherlands)

    Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G.; Chatwal, Jasmeer P.; Papp, Kathryn V.; Amariglio, Rebecca E.; Blacker, Deborah; Rentz, Dorene M.; Johnson, Keith A.; Sperling, Reisa A.; Schultz, Aaron P.

    2017-01-01

    The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging.

  8. BASE MAP DATASET, HONOLULU COUNTY, HAWAII, USA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...

  9. BASE MAP DATASET, EDGEFIELD COUNTY, SOUTH CAROLINA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...

  10. Simulation of Smart Home Activity Datasets

    Directory of Open Access Journals (Sweden)

    Jonathan Synnott

    2015-06-01

    Full Text Available A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.

  11. Simulation of Smart Home Activity Datasets.

    Science.gov (United States)

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-06-16

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.

  12. Environmental Dataset Gateway (EDG) REST Interface

    Data.gov (United States)

    U.S. Environmental Protection Agency — Use the Environmental Dataset Gateway (EDG) to find and access EPA's environmental resources. Many options are available for easily reusing EDG content in other...

  13. BASE MAP DATASET, INYO COUNTY, OKLAHOMA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...

  14. BASE MAP DATASET, JACKSON COUNTY, OKLAHOMA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...

  15. BASE MAP DATASET, SANTA CRIZ COUNTY, CALIFORNIA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...

  16. Climate Prediction Center IR 4km Dataset

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — CPC IR 4km dataset was created from all available individual geostationary satellite data which have been merged to form nearly seamless global (60N-60S) IR...

  17. BASE MAP DATASET, MAYES COUNTY, OKLAHOMA, USA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications: cadastral, geodetic control,...

  18. BASE MAP DATASET, KINGFISHER COUNTY, OKLAHOMA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...

  19. Comparison of recent SnIa datasets

    International Nuclear Information System (INIS)

    Sanchez, J.C. Bueno; Perivolaropoulos, L.; Nesseris, S.

    2009-01-01

    We rank the six latest Type Ia supernova (SnIa) datasets (Constitution (C), Union (U), ESSENCE (Davis) (E), Gold06 (G), SNLS 1yr (S) and SDSS-II (D)) in the context of the Chevalier-Polarski-Linder (CPL) parametrization w(a) = w 0 +w 1 (1−a), according to their Figure of Merit (FoM), their consistency with the cosmological constant (ΛCDM), their consistency with standard rulers (Cosmic Microwave Background (CMB) and Baryon Acoustic Oscillations (BAO)) and their mutual consistency. We find a significant improvement of the FoM (defined as the inverse area of the 95.4% parameter contour) with the number of SnIa of these datasets ((C) highest FoM, (U), (G), (D), (E), (S) lowest FoM). Standard rulers (CMB+BAO) have a better FoM by about a factor of 3, compared to the highest FoM SnIa dataset (C). We also find that the ranking sequence based on consistency with ΛCDM is identical with the corresponding ranking based on consistency with standard rulers ((S) most consistent, (D), (C), (E), (U), (G) least consistent). The ranking sequence of the datasets however changes when we consider the consistency with an expansion history corresponding to evolving dark energy (w 0 ,w 1 ) = (−1.4,2) crossing the phantom divide line w = −1 (it is practically reversed to (G), (U), (E), (S), (D), (C)). The SALT2 and MLCS2k2 fitters are also compared and some peculiar features of the SDSS-II dataset when standardized with the MLCS2k2 fitter are pointed out. Finally, we construct a statistic to estimate the internal consistency of a collection of SnIa datasets. We find that even though there is good consistency among most samples taken from the above datasets, this consistency decreases significantly when the Gold06 (G) dataset is included in the sample

  20. Comparison of Shallow Survey 2012 Multibeam Datasets

    Science.gov (United States)

    Ramirez, T. M.

    2012-12-01

    The purpose of the Shallow Survey common dataset is a comparison of the different technologies utilized for data acquisition in the shallow survey marine environment. The common dataset consists of a series of surveys conducted over a common area of seabed using a variety of systems. It provides equipment manufacturers the opportunity to showcase their latest systems while giving hydrographic researchers and scientists a chance to test their latest algorithms on the dataset so that rigorous comparisons can be made. Five companies collected data for the Common Dataset in the Wellington Harbor area in New Zealand between May 2010 and May 2011; including Kongsberg, Reson, R2Sonic, GeoAcoustics, and Applied Acoustics. The Wellington harbor and surrounding coastal area was selected since it has a number of well-defined features, including the HMNZS South Seas and HMNZS Wellington wrecks, an armored seawall constructed of Tetrapods and Akmons, aquifers, wharves and marinas. The seabed inside the harbor basin is largely fine-grained sediment, with gravel and reefs around the coast. The area outside the harbor on the southern coast is an active environment, with moving sand and exposed reefs. A marine reserve is also in this area. For consistency between datasets, the coastal research vessel R/V Ikatere and crew were used for all surveys conducted for the common dataset. Using Triton's Perspective processing software multibeam datasets collected for the Shallow Survey were processed for detail analysis. Datasets from each sonar manufacturer were processed using the CUBE algorithm developed by the Center for Coastal and Ocean Mapping/Joint Hydrographic Center (CCOM/JHC). Each dataset was gridded at 0.5 and 1.0 meter resolutions for cross comparison and compliance with International Hydrographic Organization (IHO) requirements. Detailed comparisons were made of equipment specifications (transmit frequency, number of beams, beam width), data density, total uncertainty, and

  1. 3DSEM: A 3D microscopy dataset

    Directory of Open Access Journals (Sweden)

    Ahmad P. Tafti

    2016-03-01

    Full Text Available The Scanning Electron Microscope (SEM as a 2D imaging instrument has been widely used in many scientific disciplines including biological, mechanical, and materials sciences to determine the surface attributes of microscopic objects. However the SEM micrographs still remain 2D images. To effectively measure and visualize the surface properties, we need to truly restore the 3D shape model from 2D SEM images. Having 3D surfaces would provide anatomic shape of micro-samples which allows for quantitative measurements and informative visualization of the specimens being investigated. The 3DSEM is a dataset for 3D microscopy vision which is freely available at [1] for any academic, educational, and research purposes. The dataset includes both 2D images and 3D reconstructed surfaces of several real microscopic samples. Keywords: 3D microscopy dataset, 3D microscopy vision, 3D SEM surface reconstruction, Scanning Electron Microscope (SEM

  2. Data Mining for Imbalanced Datasets: An Overview

    Science.gov (United States)

    Chawla, Nitesh V.

    A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

  3. Genomics dataset of unidentified disclosed isolates

    Directory of Open Access Journals (Sweden)

    Bhagwan N. Rekadwad

    2016-09-01

    Full Text Available Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis. Keywords: BioLABs, Blunt ends, Genomics, NEB cutter, Restriction digestion, Short DNA sequences, Sticky ends

  4. Harvard Aging Brain Study: Dataset and accessibility.

    Science.gov (United States)

    Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G; Chatwal, Jasmeer P; Papp, Kathryn V; Amariglio, Rebecca E; Blacker, Deborah; Rentz, Dorene M; Johnson, Keith A; Sperling, Reisa A; Schultz, Aaron P

    2017-01-01

    The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging. To promote more extensive analyses, imaging data was designed to be compatible with other publicly available datasets. A cloud-based system enables access to interested researchers with blinded data available contingent upon completion of a data usage agreement and administrative approval. Data collection is ongoing and currently in its fifth year. Copyright © 2015 Elsevier Inc. All rights reserved.

  5. Random Coefficient Logit Model for Large Datasets

    NARCIS (Netherlands)

    C. Hernández-Mireles (Carlos); D. Fok (Dennis)

    2010-01-01

    textabstractWe present an approach for analyzing market shares and products price elasticities based on large datasets containing aggregate sales data for many products, several markets and for relatively long time periods. We consider the recently proposed Bayesian approach of Jiang et al [Jiang,

  6. Thesaurus Dataset of Educational Technology in Chinese

    Science.gov (United States)

    Wu, Linjing; Liu, Qingtang; Zhao, Gang; Huang, Huan; Huang, Tao

    2015-01-01

    The thesaurus dataset of educational technology is a knowledge description of educational technology in Chinese. The aims of this thesaurus were to collect the subject terms in the domain of educational technology, facilitate the standardization of terminology and promote the communication between Chinese researchers and scholars from various…

  7. Sharing Video Datasets in Design Research

    DEFF Research Database (Denmark)

    Christensen, Bo; Abildgaard, Sille Julie Jøhnk

    2017-01-01

    This paper examines how design researchers, design practitioners and design education can benefit from sharing a dataset. We present the Design Thinking Research Symposium 11 (DTRS11) as an exemplary project that implied sharing video data of design processes and design activity in natural settings...... with a large group of fellow academics from the international community of Design Thinking Research, for the purpose of facilitating research collaboration and communication within the field of Design and Design Thinking. This approach emphasizes the social and collaborative aspects of design research, where...... a multitude of appropriate perspectives and methods may be utilized in analyzing and discussing the singular dataset. The shared data is, from this perspective, understood as a design object in itself, which facilitates new ways of working, collaborating, studying, learning and educating within the expanding...

  8. Automatic processing of multimodal tomography datasets.

    Science.gov (United States)

    Parsons, Aaron D; Price, Stephen W T; Wadeson, Nicola; Basham, Mark; Beale, Andrew M; Ashton, Alun W; Mosselmans, J Frederick W; Quinn, Paul D

    2017-01-01

    With the development of fourth-generation high-brightness synchrotrons on the horizon, the already large volume of data that will be collected on imaging and mapping beamlines is set to increase by orders of magnitude. As such, an easy and accessible way of dealing with such large datasets as quickly as possible is required in order to be able to address the core scientific problems during the experimental data collection. Savu is an accessible and flexible big data processing framework that is able to deal with both the variety and the volume of data of multimodal and multidimensional scientific datasets output such as those from chemical tomography experiments on the I18 microfocus scanning beamline at Diamond Light Source.

  9. Interpolation of diffusion weighted imaging datasets

    DEFF Research Database (Denmark)

    Dyrby, Tim B; Lundell, Henrik; Burke, Mark W

    2014-01-01

    anatomical details and signal-to-noise-ratio for reliable fibre reconstruction. We assessed the potential benefits of interpolating DWI datasets to a higher image resolution before fibre reconstruction using a diffusion tensor model. Simulations of straight and curved crossing tracts smaller than or equal......Diffusion weighted imaging (DWI) is used to study white-matter fibre organisation, orientation and structural connectivity by means of fibre reconstruction algorithms and tractography. For clinical settings, limited scan time compromises the possibilities to achieve high image resolution for finer...... interpolation methods fail to disentangle fine anatomical details if PVE is too pronounced in the original data. As for validation we used ex-vivo DWI datasets acquired at various image resolutions as well as Nissl-stained sections. Increasing the image resolution by a factor of eight yielded finer geometrical...

  10. Data assimilation and model evaluation experiment datasets

    Science.gov (United States)

    Lai, Chung-Cheng A.; Qian, Wen; Glenn, Scott M.

    1994-01-01

    The Institute for Naval Oceanography, in cooperation with Naval Research Laboratories and universities, executed the Data Assimilation and Model Evaluation Experiment (DAMEE) for the Gulf Stream region during fiscal years 1991-1993. Enormous effort has gone into the preparation of several high-quality and consistent datasets for model initialization and verification. This paper describes the preparation process, the temporal and spatial scopes, the contents, the structure, etc., of these datasets. The goal of DAMEE and the need of data for the four phases of experiment are briefly stated. The preparation of DAMEE datasets consisted of a series of processes: (1) collection of observational data; (2) analysis and interpretation; (3) interpolation using the Optimum Thermal Interpolation System package; (4) quality control and re-analysis; and (5) data archiving and software documentation. The data products from these processes included a time series of 3D fields of temperature and salinity, 2D fields of surface dynamic height and mixed-layer depth, analysis of the Gulf Stream and rings system, and bathythermograph profiles. To date, these are the most detailed and high-quality data for mesoscale ocean modeling, data assimilation, and forecasting research. Feedback from ocean modeling groups who tested this data was incorporated into its refinement. Suggestions for DAMEE data usages include (1) ocean modeling and data assimilation studies, (2) diagnosis and theoretical studies, and (3) comparisons with locally detailed observations.

  11. A hybrid organic-inorganic perovskite dataset

    Science.gov (United States)

    Kim, Chiho; Huan, Tran Doan; Krishnan, Sridevi; Ramprasad, Rampi

    2017-05-01

    Hybrid organic-inorganic perovskites (HOIPs) have been attracting a great deal of attention due to their versatility of electronic properties and fabrication methods. We prepare a dataset of 1,346 HOIPs, which features 16 organic cations, 3 group-IV cations and 4 halide anions. Using a combination of an atomic structure search method and density functional theory calculations, the optimized structures, the bandgap, the dielectric constant, and the relative energies of the HOIPs are uniformly prepared and validated by comparing with relevant experimental and/or theoretical data. We make the dataset available at Dryad Digital Repository, NoMaD Repository, and Khazana Repository (http://khazana.uconn.edu/), hoping that it could be useful for future data-mining efforts that can explore possible structure-property relationships and phenomenological models. Progressive extension of the dataset is expected as new organic cations become appropriate within the HOIP framework, and as additional properties are calculated for the new compounds found.

  12. Quantifying uncertainty in observational rainfall datasets

    Science.gov (United States)

    Lennard, Chris; Dosio, Alessandro; Nikulin, Grigory; Pinto, Izidine; Seid, Hussen

    2015-04-01

    The CO-ordinated Regional Downscaling Experiment (CORDEX) has to date seen the publication of at least ten journal papers that examine the African domain during 2012 and 2013. Five of these papers consider Africa generally (Nikulin et al. 2012, Kim et al. 2013, Hernandes-Dias et al. 2013, Laprise et al. 2013, Panitz et al. 2013) and five have regional foci: Tramblay et al. (2013) on Northern Africa, Mariotti et al. (2014) and Gbobaniyi el al. (2013) on West Africa, Endris et al. (2013) on East Africa and Kalagnoumou et al. (2013) on southern Africa. There also are a further three papers that the authors know about under review. These papers all use an observed rainfall and/or temperature data to evaluate/validate the regional model output and often proceed to assess projected changes in these variables due to climate change in the context of these observations. The most popular reference rainfall data used are the CRU, GPCP, GPCC, TRMM and UDEL datasets. However, as Kalagnoumou et al. (2013) point out there are many other rainfall datasets available for consideration, for example, CMORPH, FEWS, TAMSAT & RIANNAA, TAMORA and the WATCH & WATCH-DEI data. They, with others (Nikulin et al. 2012, Sylla et al. 2012) show that the observed datasets can have a very wide spread at a particular space-time coordinate. As more ground, space and reanalysis-based rainfall products become available, all which use different methods to produce precipitation data, the selection of reference data is becoming an important factor in model evaluation. A number of factors can contribute to a uncertainty in terms of the reliability and validity of the datasets such as radiance conversion algorithims, the quantity and quality of available station data, interpolation techniques and blending methods used to combine satellite and guage based products. However, to date no comprehensive study has been performed to evaluate the uncertainty in these observational datasets. We assess 18 gridded

  13. Development of a SPARK Training Dataset

    Energy Technology Data Exchange (ETDEWEB)

    Sayre, Amanda M. [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Olson, Jarrod R. [Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

    2015-03-01

    In its first five years, the National Nuclear Security Administration’s (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and across locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK’s intended analysis capability. The analysis demonstration sought to answer the

  14. Development of a SPARK Training Dataset

    International Nuclear Information System (INIS)

    Sayre, Amanda M.; Olson, Jarrod R.

    2015-01-01

    In its first five years, the National Nuclear Security Administration's (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and across locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK's intended analysis capability. The analysis demonstration sought to answer

  15. Developing a Data-Set for Stereopsis

    Directory of Open Access Journals (Sweden)

    D.W Hunter

    2014-08-01

    Full Text Available Current research on binocular stereopsis in humans and non-human primates has been limited by a lack of available data-sets. Current data-sets fall into two categories; stereo-image sets with vergence but no ranging information (Hibbard, 2008, Vision Research, 48(12, 1427-1439 or combinations of depth information with binocular images and video taken from cameras in fixed fronto-parallel configurations exhibiting neither vergence or focus effects (Hirschmuller & Scharstein, 2007, IEEE Conf. Computer Vision and Pattern Recognition. The techniques for generating depth information are also imperfect. Depth information is normally inaccurate or simply missing near edges and on partially occluded surfaces. For many areas of vision research these are the most interesting parts of the image (Goutcher, Hunter, Hibbard, 2013, i-Perception, 4(7, 484; Scarfe & Hibbard, 2013, Vision Research. Using state-of-the-art open-source ray-tracing software (PBRT as a back-end, our intention is to release a set of tools that will allow researchers in this field to generate artificial binocular stereoscopic data-sets. Although not as realistic as photographs, computer generated images have significant advantages in terms of control over the final output and ground-truth information about scene depth is easily calculated at all points in the scene, even partially occluded areas. While individual researchers have been developing similar stimuli by hand for many decades, we hope that our software will greatly reduce the time and difficulty of creating naturalistic binocular stimuli. Our intension in making this presentation is to elicit feedback from the vision community about what sort of features would be desirable in such software.

  16. Quality Controlling CMIP datasets at GFDL

    Science.gov (United States)

    Horowitz, L. W.; Radhakrishnan, A.; Balaji, V.; Adcroft, A.; Krasting, J. P.; Nikonov, S.; Mason, E. E.; Schweitzer, R.; Nadeau, D.

    2017-12-01

    As GFDL makes the switch from model development to production in light of the Climate Model Intercomparison Project (CMIP), GFDL's efforts are shifted to testing and more importantly establishing guidelines and protocols for Quality Controlling and semi-automated data publishing. Every CMIP cycle introduces key challenges and the upcoming CMIP6 is no exception. The new CMIP experimental design comprises of multiple MIPs facilitating research in different focus areas. This paradigm has implications not only for the groups that develop the models and conduct the runs, but also for the groups that monitor, analyze and quality control the datasets before data publishing, before their knowledge makes its way into reports like the IPCC (Intergovernmental Panel on Climate Change) Assessment Reports. In this talk, we discuss some of the paths taken at GFDL to quality control the CMIP-ready datasets including: Jupyter notebooks, PrePARE, LAMP (Linux, Apache, MySQL, PHP/Python/Perl): technology-driven tracker system to monitor the status of experiments qualitatively and quantitatively, provide additional metadata and analysis services along with some in-built controlled-vocabulary validations in the workflow. In addition to this, we also discuss the integration of community-based model evaluation software (ESMValTool, PCMDI Metrics Package, and ILAMB) as part of our CMIP6 workflow.

  17. Integrated remotely sensed datasets for disaster management

    Science.gov (United States)

    McCarthy, Timothy; Farrell, Ronan; Curtis, Andrew; Fotheringham, A. Stewart

    2008-10-01

    Video imagery can be acquired from aerial, terrestrial and marine based platforms and has been exploited for a range of remote sensing applications over the past two decades. Examples include coastal surveys using aerial video, routecorridor infrastructures surveys using vehicle mounted video cameras, aerial surveys over forestry and agriculture, underwater habitat mapping and disaster management. Many of these video systems are based on interlaced, television standards such as North America's NTSC and European SECAM and PAL television systems that are then recorded using various video formats. This technology has recently being employed as a front-line, remote sensing technology for damage assessment post-disaster. This paper traces the development of spatial video as a remote sensing tool from the early 1980s to the present day. The background to a new spatial-video research initiative based at National University of Ireland, Maynooth, (NUIM) is described. New improvements are proposed and include; low-cost encoders, easy to use software decoders, timing issues and interoperability. These developments will enable specialists and non-specialists collect, process and integrate these datasets within minimal support. This integrated approach will enable decision makers to access relevant remotely sensed datasets quickly and so, carry out rapid damage assessment during and post-disaster.

  18. Strontium removal jar test dataset for all figures and tables.

    Data.gov (United States)

    U.S. Environmental Protection Agency — The datasets where used to generate data to demonstrate strontium removal under various water quality and treatment conditions. This dataset is associated with the...

  19. Predicting dataset popularity for the CMS experiment

    CERN Document Server

    INSPIRE-00005122; Li, Ting; Giommi, Luca; Bonacorsi, Daniele; Wildish, Tony

    2016-01-01

    The CMS experiment at the LHC accelerator at CERN relies on its computing infrastructure to stay at the frontier of High Energy Physics, searching for new phenomena and making discoveries. Even though computing plays a significant role in physics analysis we rarely use its data to predict the system behavior itself. A basic information about computing resources, user activities and site utilization can be really useful for improving the throughput of the system and its management. In this paper, we discuss a first CMS analysis of dataset popularity based on CMS meta-data which can be used as a model for dynamic data placement and provide the foundation of data-driven approach for the CMS computing infrastructure.

  20. Predicting dataset popularity for the CMS experiment

    International Nuclear Information System (INIS)

    Kuznetsov, V.; Li, T.; Giommi, L.; Bonacorsi, D.; Wildish, T.

    2016-01-01

    The CMS experiment at the LHC accelerator at CERN relies on its computing infrastructure to stay at the frontier of High Energy Physics, searching for new phenomena and making discoveries. Even though computing plays a significant role in physics analysis we rarely use its data to predict the system behavior itself. A basic information about computing resources, user activities and site utilization can be really useful for improving the throughput of the system and its management. In this paper, we discuss a first CMS analysis of dataset popularity based on CMS meta-data which can be used as a model for dynamic data placement and provide the foundation of data-driven approach for the CMS computing infrastructure. (paper)

  1. Internationally coordinated glacier monitoring: strategy and datasets

    Science.gov (United States)

    Hoelzle, Martin; Armstrong, Richard; Fetterer, Florence; Gärtner-Roer, Isabelle; Haeberli, Wilfried; Kääb, Andreas; Kargel, Jeff; Nussbaumer, Samuel; Paul, Frank; Raup, Bruce; Zemp, Michael

    2014-05-01

    (c) the Randolph Glacier Inventory (RGI), a new and globally complete digital dataset of outlines from about 180,000 glaciers with some meta-information, which has been used for many applications relating to the IPCC AR5 report. Concerning glacier changes, a database (Fluctuations of Glaciers) exists containing information about mass balance, front variations including past reconstructed time series, geodetic changes and special events. Annual mass balance reporting contains information for about 125 glaciers with a subset of 37 glaciers with continuous observational series since 1980 or earlier. Front variation observations of around 1800 glaciers are available from most of the mountain ranges world-wide. This database was recently updated with 26 glaciers having an unprecedented dataset of length changes from from reconstructions of well-dated historical evidence going back as far as the 16th century. Geodetic observations of about 430 glaciers are available. The database is completed by a dataset containing information on special events including glacier surges, glacier lake outbursts, ice avalanches, eruptions of ice-clad volcanoes, etc. related to about 200 glaciers. A special database of glacier photographs contains 13,000 pictures from around 500 glaciers, some of them dating back to the 19th century. A key challenge is to combine and extend the traditional observations with fast evolving datasets from new technologies.

  2. MIPS bacterial genomes functional annotation benchmark dataset.

    Science.gov (United States)

    Tetko, Igor V; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Fobo, Gisela; Ruepp, Andreas; Antonov, Alexey V; Surmeli, Dimitrij; Mewes, Hans-Wernen

    2005-05-15

    Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. BFAB is available at http://mips.gsf.de/proj/bfab

  3. 2006 Fynmeet sea clutter measurement trial: Datasets

    CSIR Research Space (South Africa)

    Herselman, PLR

    2007-09-06

    Full Text Available -011............................................................................................................................................................................................. 25 iii Dataset CAD14-001 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 80 90 R an ge G at e # Time [s] A bs ol ut e R an ge [m ] RCS [dBm2] vs. time and range for f1 = 9.000 GHz - CAD14-001 2400 2600 2800... 40 10 20 30 40 50 60 70 80 90 R an ge G at e # Time [s] A bs ol ut e R an ge [m ] RCS [dBm2] vs. time and range for f1 = 9.000 GHz - CAD14-002 2400 2600 2800 3000 3200 3400 3600 -30 -25 -20 -15 -10 -5 0 5 10...

  4. A new bed elevation dataset for Greenland

    Directory of Open Access Journals (Sweden)

    J. L. Bamber

    2013-03-01

    Full Text Available We present a new bed elevation dataset for Greenland derived from a combination of multiple airborne ice thickness surveys undertaken between the 1970s and 2012. Around 420 000 line kilometres of airborne data were used, with roughly 70% of this having been collected since the year 2000, when the last comprehensive compilation was undertaken. The airborne data were combined with satellite-derived elevations for non-glaciated terrain to produce a consistent bed digital elevation model (DEM over the entire island including across the glaciated–ice free boundary. The DEM was extended to the continental margin with the aid of bathymetric data, primarily from a compilation for the Arctic. Ice thickness was determined where an ice shelf exists from a combination of surface elevation and radar soundings. The across-track spacing between flight lines warranted interpolation at 1 km postings for significant sectors of the ice sheet. Grids of ice surface elevation, error estimates for the DEM, ice thickness and data sampling density were also produced alongside a mask of land/ocean/grounded ice/floating ice. Errors in bed elevation range from a minimum of ±10 m to about ±300 m, as a function of distance from an observation and local topographic variability. A comparison with the compilation published in 2001 highlights the improvement in resolution afforded by the new datasets, particularly along the ice sheet margin, where ice velocity is highest and changes in ice dynamics most marked. We estimate that the volume of ice included in our land-ice mask would raise mean sea level by 7.36 m, excluding any solid earth effects that would take place during ice sheet decay.

  5. Wind Integration National Dataset Toolkit | Grid Modernization | NREL

    Science.gov (United States)

    Integration National Dataset Toolkit Wind Integration National Dataset Toolkit The Wind Integration National Dataset (WIND) Toolkit is an update and expansion of the Eastern Wind Integration Data Set and Western Wind Integration Data Set. It supports the next generation of wind integration studies. WIND

  6. Solar Integration National Dataset Toolkit | Grid Modernization | NREL

    Science.gov (United States)

    Solar Integration National Dataset Toolkit Solar Integration National Dataset Toolkit NREL is working on a Solar Integration National Dataset (SIND) Toolkit to enable researchers to perform U.S . regional solar generation integration studies. It will provide modeled, coherent subhourly solar power data

  7. Technical note: An inorganic water chemistry dataset (1972–2011 ...

    African Journals Online (AJOL)

    A national dataset of inorganic chemical data of surface waters (rivers, lakes, and dams) in South Africa is presented and made freely available. The dataset comprises more than 500 000 complete water analyses from 1972 up to 2011, collected from more than 2 000 sample monitoring stations in South Africa. The dataset ...

  8. QSAR ligand dataset for modelling mutagenicity, genotoxicity, and rodent carcinogenicity

    Directory of Open Access Journals (Sweden)

    Davy Guan

    2018-04-01

    Full Text Available Five datasets were constructed from ligand and bioassay result data from the literature. These datasets include bioassay results from the Ames mutagenicity assay, Greenscreen GADD-45a-GFP assay, Syrian Hamster Embryo (SHE assay, and 2 year rat carcinogenicity assay results. These datasets provide information about chemical mutagenicity, genotoxicity and carcinogenicity.

  9. Statistical segmentation of multidimensional brain datasets

    Science.gov (United States)

    Desco, Manuel; Gispert, Juan D.; Reig, Santiago; Santos, Andres; Pascau, Javier; Malpica, Norberto; Garcia-Barreno, Pedro

    2001-07-01

    This paper presents an automatic segmentation procedure for MRI neuroimages that overcomes part of the problems involved in multidimensional clustering techniques like partial volume effects (PVE), processing speed and difficulty of incorporating a priori knowledge. The method is a three-stage procedure: 1) Exclusion of background and skull voxels using threshold-based region growing techniques with fully automated seed selection. 2) Expectation Maximization algorithms are used to estimate the probability density function (PDF) of the remaining pixels, which are assumed to be mixtures of gaussians. These pixels can then be classified into cerebrospinal fluid (CSF), white matter and grey matter. Using this procedure, our method takes advantage of using the full covariance matrix (instead of the diagonal) for the joint PDF estimation. On the other hand, logistic discrimination techniques are more robust against violation of multi-gaussian assumptions. 3) A priori knowledge is added using Markov Random Field techniques. The algorithm has been tested with a dataset of 30 brain MRI studies (co-registered T1 and T2 MRI). Our method was compared with clustering techniques and with template-based statistical segmentation, using manual segmentation as a gold-standard. Our results were more robust and closer to the gold-standard.

  10. ASSESSING SMALL SAMPLE WAR-GAMING DATASETS

    Directory of Open Access Journals (Sweden)

    W. J. HURLEY

    2013-10-01

    Full Text Available One of the fundamental problems faced by military planners is the assessment of changes to force structure. An example is whether to replace an existing capability with an enhanced system. This can be done directly with a comparison of measures such as accuracy, lethality, survivability, etc. However this approach does not allow an assessment of the force multiplier effects of the proposed change. To gauge these effects, planners often turn to war-gaming. For many war-gaming experiments, it is expensive, both in terms of time and dollars, to generate a large number of sample observations. This puts a premium on the statistical methodology used to examine these small datasets. In this paper we compare the power of three tests to assess population differences: the Wald-Wolfowitz test, the Mann-Whitney U test, and re-sampling. We employ a series of Monte Carlo simulation experiments. Not unexpectedly, we find that the Mann-Whitney test performs better than the Wald-Wolfowitz test. Resampling is judged to perform slightly better than the Mann-Whitney test.

  11. The Dataset of Countries at Risk of Electoral Violence

    OpenAIRE

    Birch, Sarah; Muchlinski, David

    2017-01-01

    Electoral violence is increasingly affecting elections around the world, yet researchers have been limited by a paucity of granular data on this phenomenon. This paper introduces and describes a new dataset of electoral violence – the Dataset of Countries at Risk of Electoral Violence (CREV) – that provides measures of 10 different types of electoral violence across 642 elections held around the globe between 1995 and 2013. The paper provides a detailed account of how and why the dataset was ...

  12. Norwegian Hydrological Reference Dataset for Climate Change Studies

    Energy Technology Data Exchange (ETDEWEB)

    Magnussen, Inger Helene; Killingland, Magnus; Spilde, Dag

    2012-07-01

    Based on the Norwegian hydrological measurement network, NVE has selected a Hydrological Reference Dataset for studies of hydrological change. The dataset meets international standards with high data quality. It is suitable for monitoring and studying the effects of climate change on the hydrosphere and cryosphere in Norway. The dataset includes streamflow, groundwater, snow, glacier mass balance and length change, lake ice and water temperature in rivers and lakes.(Author)

  13. Public Availability to ECS Collected Datasets

    Science.gov (United States)

    Henderson, J. F.; Warnken, R.; McLean, S. J.; Lim, E.; Varner, J. D.

    2013-12-01

    Coastal nations have spent considerable resources exploring the limits of their extended continental shelf (ECS) beyond 200 nm. Although these studies are funded to fulfill requirements of the UN Convention on the Law of the Sea, the investments are producing new data sets in frontier areas of Earth's oceans that will be used to understand, explore, and manage the seafloor and sub-seafloor for decades to come. Although many of these datasets are considered proprietary until a nation's potential ECS has become 'final and binding' an increasing amount of data are being released and utilized by the public. Data sets include multibeam, seismic reflection/refraction, bottom sampling, and geophysical data. The U.S. ECS Project, a multi-agency collaboration whose mission is to establish the full extent of the continental shelf of the United States consistent with international law, relies heavily on data and accurate, standard metadata. The United States has made it a priority to make available to the public all data collected with ECS-funding as quickly as possible. The National Oceanic and Atmospheric Administration's (NOAA) National Geophysical Data Center (NGDC) supports this objective by partnering with academia and other federal government mapping agencies to archive, inventory, and deliver marine mapping data in a coordinated, consistent manner. This includes ensuring quality, standard metadata and developing and maintaining data delivery capabilities built on modern digital data archives. Other countries, such as Ireland, have submitted their ECS data for public availability and many others have made pledges to participate in the future. The data services provided by NGDC support the U.S. ECS effort as well as many developing nation's ECS effort through the U.N. Environmental Program. Modern discovery, visualization, and delivery of scientific data and derived products that span national and international sources of data ensure the greatest re-use of data and

  14. BIA Indian Lands Dataset (Indian Lands of the United States)

    Data.gov (United States)

    Federal Geographic Data Committee — The American Indian Reservations / Federally Recognized Tribal Entities dataset depicts feature location, selected demographics and other associated data for the 561...

  15. Framework for Interactive Parallel Dataset Analysis on the Grid

    Energy Technology Data Exchange (ETDEWEB)

    Alexander, David A.; Ananthan, Balamurali; /Tech-X Corp.; Johnson, Tony; Serbo, Victor; /SLAC

    2007-01-10

    We present a framework for use at a typical Grid site to facilitate custom interactive parallel dataset analysis targeting terabyte-scale datasets of the type typically produced by large multi-institutional science experiments. We summarize the needs for interactive analysis and show a prototype solution that satisfies those needs. The solution consists of desktop client tool and a set of Web Services that allow scientists to sign onto a Grid site, compose analysis script code to carry out physics analysis on datasets, distribute the code and datasets to worker nodes, collect the results back to the client, and to construct professional-quality visualizations of the results.

  16. Socioeconomic Data and Applications Center (SEDAC) Treaty Status Dataset

    Data.gov (United States)

    National Aeronautics and Space Administration — The Socioeconomic Data and Application Center (SEDAC) Treaty Status Dataset contains comprehensive treaty information for multilateral environmental agreements,...

  17. An Analysis of the GTZAN Music Genre Dataset

    DEFF Research Database (Denmark)

    Sturm, Bob L.

    2012-01-01

    Most research in automatic music genre recognition has used the dataset assembled by Tzanetakis et al. in 2001. The composition and integrity of this dataset, however, has never been formally analyzed. For the first time, we provide an analysis of its composition, and create a machine...

  18. Really big data: Processing and analysis of large datasets

    Science.gov (United States)

    Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidl...

  19. An Annotated Dataset of 14 Cardiac MR Images

    DEFF Research Database (Denmark)

    Stegmann, Mikkel Bille

    2002-01-01

    This note describes a dataset consisting of 14 annotated cardiac MR images. Points of correspondence are placed on each image at the left ventricle (LV). As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given....

  20. A New Outlier Detection Method for Multidimensional Datasets

    KAUST Repository

    Abdel Messih, Mario A.

    2012-07-01

    This study develops a novel hybrid method for outlier detection (HMOD) that combines the idea of distance based and density based methods. The proposed method has two main advantages over most of the other outlier detection methods. The first advantage is that it works well on both dense and sparse datasets. The second advantage is that, unlike most other outlier detection methods that require careful parameter setting and prior knowledge of the data, HMOD is not very sensitive to small changes in parameter values within certain parameter ranges. The only required parameter to set is the number of nearest neighbors. In addition, we made a fully parallelized implementation of HMOD that made it very efficient in applications. Moreover, we proposed a new way of using the outlier detection for redundancy reduction in datasets where the confidence level that evaluates how accurate the less redundant dataset can be used to represent the original dataset can be specified by users. HMOD is evaluated on synthetic datasets (dense and mixed “dense and sparse”) and a bioinformatics problem of redundancy reduction of dataset of position weight matrices (PWMs) of transcription factor binding sites. In addition, in the process of assessing the performance of our redundancy reduction method, we developed a simple tool that can be used to evaluate the confidence level of reduced dataset representing the original dataset. The evaluation of the results shows that our method can be used in a wide range of problems.

  1. ATLAS File and Dataset Metadata Collection and Use

    CERN Document Server

    Albrand, S; The ATLAS collaboration; Lambert, F; Gallas, E J

    2012-01-01

    The ATLAS Metadata Interface (“AMI”) was designed as a generic cataloguing system, and as such it has found many uses in the experiment including software release management, tracking of reconstructed event sizes and control of dataset nomenclature. The primary use of AMI is to provide a catalogue of datasets (file collections) which is searchable using physics criteria. In this paper we discuss the various mechanisms used for filling the AMI dataset and file catalogues. By correlating information from different sources we can derive aggregate information which is important for physics analysis; for example the total number of events contained in dataset, and possible reasons for missing events such as a lost file. Finally we will describe some specialized interfaces which were developed for the Data Preparation and reprocessing coordinators. These interfaces manipulate information from both the dataset domain held in AMI, and the run-indexed information held in the ATLAS COMA application (Conditions and ...

  2. A dataset on tail risk of commodities markets.

    Science.gov (United States)

    Powell, Robert J; Vo, Duc H; Pham, Thach N; Singh, Abhay K

    2017-12-01

    This article contains the datasets related to the research article "The long and short of commodity tails and their relationship to Asian equity markets"(Powell et al., 2017) [1]. The datasets contain the daily prices (and price movements) of 24 different commodities decomposed from the S&P GSCI index and the daily prices (and price movements) of three share market indices including World, Asia, and South East Asia for the period 2004-2015. Then, the dataset is divided into annual periods, showing the worst 5% of price movements for each year. The datasets are convenient to examine the tail risk of different commodities as measured by Conditional Value at Risk (CVaR) as well as their changes over periods. The datasets can also be used to investigate the association between commodity markets and share markets.

  3. Discovery and Reuse of Open Datasets: An Exploratory Study

    Directory of Open Access Journals (Sweden)

    Sara

    2016-07-01

    Full Text Available Objective: This article analyzes twenty cited or downloaded datasets and the repositories that house them, in order to produce insights that can be used by academic libraries to encourage discovery and reuse of research data in institutional repositories. Methods: Using Thomson Reuters’ Data Citation Index and repository download statistics, we identified twenty cited/downloaded datasets. We documented the characteristics of the cited/downloaded datasets and their corresponding repositories in a self-designed rubric. The rubric includes six major categories: basic information; funding agency and journal information; linking and sharing; factors to encourage reuse; repository characteristics; and data description. Results: Our small-scale study suggests that cited/downloaded datasets generally comply with basic recommendations for facilitating reuse: data are documented well; formatted for use with a variety of software; and shared in established, open access repositories. Three significant factors also appear to contribute to dataset discovery: publishing in discipline-specific repositories; indexing in more than one location on the web; and using persistent identifiers. The cited/downloaded datasets in our analysis came from a few specific disciplines, and tended to be funded by agencies with data publication mandates. Conclusions: The results of this exploratory research provide insights that can inform academic librarians as they work to encourage discovery and reuse of institutional datasets. Our analysis also suggests areas in which academic librarians can target open data advocacy in their communities in order to begin to build open data success stories that will fuel future advocacy efforts.

  4. Viability of Controlling Prosthetic Hand Utilizing Electroencephalograph (EEG) Dataset Signal

    Science.gov (United States)

    Miskon, Azizi; A/L Thanakodi, Suresh; Raihan Mazlan, Mohd; Mohd Haziq Azhar, Satria; Nooraya Mohd Tawil, Siti

    2016-11-01

    This project presents the development of an artificial hand controlled by Electroencephalograph (EEG) signal datasets for the prosthetic application. The EEG signal datasets were used as to improvise the way to control the prosthetic hand compared to the Electromyograph (EMG). The EMG has disadvantages to a person, who has not used the muscle for a long time and also to person with degenerative issues due to age factor. Thus, the EEG datasets found to be an alternative for EMG. The datasets used in this work were taken from Brain Computer Interface (BCI) Project. The datasets were already classified for open, close and combined movement operations. It served the purpose as an input to control the prosthetic hand by using an Interface system between Microsoft Visual Studio and Arduino. The obtained results reveal the prosthetic hand to be more efficient and faster in response to the EEG datasets with an additional LiPo (Lithium Polymer) battery attached to the prosthetic. Some limitations were also identified in terms of the hand movements, weight of the prosthetic, and the suggestions to improve were concluded in this paper. Overall, the objective of this paper were achieved when the prosthetic hand found to be feasible in operation utilizing the EEG datasets.

  5. Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

    Science.gov (United States)

    Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

    2014-01-01

    SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111

  6. PROVIDING GEOGRAPHIC DATASETS AS LINKED DATA IN SDI

    Directory of Open Access Journals (Sweden)

    E. Hietanen

    2016-06-01

    Full Text Available In this study, a prototype service to provide data from Web Feature Service (WFS as linked data is implemented. At first, persistent and unique Uniform Resource Identifiers (URI are created to all spatial objects in the dataset. The objects are available from those URIs in Resource Description Framework (RDF data format. Next, a Web Ontology Language (OWL ontology is created to describe the dataset information content using the Open Geospatial Consortium’s (OGC GeoSPARQL vocabulary. The existing data model is modified in order to take into account the linked data principles. The implemented service produces an HTTP response dynamically. The data for the response is first fetched from existing WFS. Then the Geographic Markup Language (GML format output of the WFS is transformed on-the-fly to the RDF format. Content Negotiation is used to serve the data in different RDF serialization formats. This solution facilitates the use of a dataset in different applications without replicating the whole dataset. In addition, individual spatial objects in the dataset can be referred with URIs. Furthermore, the needed information content of the objects can be easily extracted from the RDF serializations available from those URIs. A solution for linking data objects to the dataset URI is also introduced by using the Vocabulary of Interlinked Datasets (VoID. The dataset is divided to the subsets and each subset is given its persistent and unique URI. This enables the whole dataset to be explored with a web browser and all individual objects to be indexed by search engines.

  7. Homogenised Australian climate datasets used for climate change monitoring

    International Nuclear Information System (INIS)

    Trewin, Blair; Jones, David; Collins; Dean; Jovanovic, Branislava; Braganza, Karl

    2007-01-01

    Full text: The Australian Bureau of Meteorology has developed a number of datasets for use in climate change monitoring. These datasets typically cover 50-200 stations distributed as evenly as possible over the Australian continent, and have been subject to detailed quality control and homogenisation.The time period over which data are available for each element is largely determined by the availability of data in digital form. Whilst nearly all Australian monthly and daily precipitation data have been digitised, a significant quantity of pre-1957 data (for temperature and evaporation) or pre-1987 data (for some other elements) remains to be digitised, and is not currently available for use in the climate change monitoring datasets. In the case of temperature and evaporation, the start date of the datasets is also determined by major changes in instruments or observing practices for which no adjustment is feasible at the present time. The datasets currently available cover: Monthly and daily precipitation (most stations commence 1915 or earlier, with many extending back to the late 19th century, and a few to the mid-19th century); Annual temperature (commences 1910); Daily temperature (commences 1910, with limited station coverage pre-1957); Twice-daily dewpoint/relative humidity (commences 1957); Monthly pan evaporation (commences 1970); Cloud amount (commences 1957) (Jovanovic etal. 2007). As well as the station-based datasets listed above, an additional dataset being developed for use in climate change monitoring (and other applications) covers tropical cyclones in the Australian region. This is described in more detail in Trewin (2007). The datasets already developed are used in analyses of observed climate change, which are available through the Australian Bureau of Meteorology website (http://www.bom.gov.au/silo/products/cli_chg/). They are also used as a basis for routine climate monitoring, and in the datasets used for the development of seasonal

  8. Tension in the recent Type Ia supernovae datasets

    International Nuclear Information System (INIS)

    Wei, Hao

    2010-01-01

    In the present work, we investigate the tension in the recent Type Ia supernovae (SNIa) datasets Constitution and Union. We show that they are in tension not only with the observations of the cosmic microwave background (CMB) anisotropy and the baryon acoustic oscillations (BAO), but also with other SNIa datasets such as Davis and SNLS. Then, we find the main sources responsible for the tension. Further, we make this more robust by employing the method of random truncation. Based on the results of this work, we suggest two truncated versions of the Union and Constitution datasets, namely the UnionT and ConstitutionT SNIa samples, whose behaviors are more regular.

  9. Background qualitative analysis of the European reference life cycle database (ELCD) energy datasets - part II: electricity datasets.

    Science.gov (United States)

    Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

    2015-01-01

    The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.

  10. Dataset definition for CMS operations and physics analyses

    Science.gov (United States)

    Franzoni, Giovanni; Compact Muon Solenoid Collaboration

    2016-04-01

    Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets and secondary datasets/dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concepts of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the LHC run I, and we discuss the plans for the run II.

  11. U.S. Climate Divisional Dataset (Version Superseded)

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — This data has been superseded by a newer version of the dataset. Please refer to NOAA's Climate Divisional Database for more information. The U.S. Climate Divisional...

  12. Karna Particle Size Dataset for Tables and Figures

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset contains 1) table of bulk Pb-XAS LCF results, 2) table of bulk As-XAS LCF results, 3) figure data of particle size distribution, and 4) figure data for...

  13. NOAA Global Surface Temperature Dataset, Version 4.0

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — The NOAA Global Surface Temperature Dataset (NOAAGlobalTemp) is derived from two independent analyses: the Extended Reconstructed Sea Surface Temperature (ERSST)...

  14. National Hydrography Dataset (NHD) - USGS National Map Downloadable Data Collection

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The USGS National Hydrography Dataset (NHD) Downloadable Data Collection from The National Map (TNM) is a comprehensive set of digital spatial data that encodes...

  15. Watershed Boundary Dataset (WBD) - USGS National Map Downloadable Data Collection

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The Watershed Boundary Dataset (WBD) from The National Map (TNM) defines the perimeter of drainage areas formed by the terrain and other landscape characteristics....

  16. BASE MAP DATASET, LE FLORE COUNTY, OKLAHOMA, USA

    Data.gov (United States)

    Federal Emergency Management Agency, Department of Homeland Security — Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme, orthographic...

  17. USGS National Hydrography Dataset from The National Map

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — USGS The National Map - National Hydrography Dataset (NHD) is a comprehensive set of digital spatial data that encodes information about naturally occurring and...

  18. A robust dataset-agnostic heart disease classifier from Phonocardiogram.

    Science.gov (United States)

    Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M

    2017-07-01

    Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.

  19. AFSC/REFM: Seabird Necropsy dataset of North Pacific

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — The seabird necropsy dataset contains information on seabird specimens that were collected under salvage and scientific collection permits primarily by...

  20. Dataset definition for CMS operations and physics analyses

    CERN Document Server

    AUTHOR|(CDS)2051291

    2016-01-01

    Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets, secondary datasets, and dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concept of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the first run, and we discuss the plans for the second LHC run.

  1. USGS National Boundary Dataset (NBD) Downloadable Data Collection

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The USGS Governmental Unit Boundaries dataset from The National Map (TNM) represents major civil areas for the Nation, including States or Territories, counties (or...

  2. Environmental Dataset Gateway (EDG) CS-W Interface

    Data.gov (United States)

    U.S. Environmental Protection Agency — Use the Environmental Dataset Gateway (EDG) to find and access EPA's environmental resources. Many options are available for easily reusing EDG content in other...

  3. Global Man-made Impervious Surface (GMIS) Dataset From Landsat

    Data.gov (United States)

    National Aeronautics and Space Administration — The Global Man-made Impervious Surface (GMIS) Dataset From Landsat consists of global estimates of fractional impervious cover derived from the Global Land Survey...

  4. A Comparative Analysis of Classification Algorithms on Diverse Datasets

    Directory of Open Access Journals (Sweden)

    M. Alghobiri

    2018-04-01

    Full Text Available Data mining involves the computational process to find patterns from large data sets. Classification, one of the main domains of data mining, involves known structure generalizing to apply to a new dataset and predict its class. There are various classification algorithms being used to classify various data sets. They are based on different methods such as probability, decision tree, neural network, nearest neighbor, boolean and fuzzy logic, kernel-based etc. In this paper, we apply three diverse classification algorithms on ten datasets. The datasets have been selected based on their size and/or number and nature of attributes. Results have been discussed using some performance evaluation measures like precision, accuracy, F-measure, Kappa statistics, mean absolute error, relative absolute error, ROC Area etc. Comparative analysis has been carried out using the performance evaluation measures of accuracy, precision, and F-measure. We specify features and limitations of the classification algorithms for the diverse nature datasets.

  5. Newton SSANTA Dr Water using POU filters dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset contains information about all the features extracted from the raw data files, the formulas that were assigned to some of these features, and the...

  6. Estimating parameters for probabilistic linkage of privacy-preserved datasets.

    Science.gov (United States)

    Brown, Adrian P; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Boyd, James H

    2017-07-10

    Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher

  7. Toward computational cumulative biology by combining models of biological datasets.

    Science.gov (United States)

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.

  8. Testing the Neutral Theory of Biodiversity with Human Microbiome Datasets

    OpenAIRE

    Li, Lianwei; Ma, Zhanshan (Sam)

    2016-01-01

    The human microbiome project (HMP) has made it possible to test important ecological theories for arguably the most important ecosystem to human health?the human microbiome. Existing limited number of studies have reported conflicting evidence in the case of the neutral theory; the present study aims to comprehensively test the neutral theory with extensive HMP datasets covering all five major body sites inhabited by the human microbiome. Utilizing 7437 datasets of bacterial community samples...

  9. General Purpose Multimedia Dataset - GarageBand 2008

    DEFF Research Database (Denmark)

    Meng, Anders

    This document describes a general purpose multimedia data-set to be used in cross-media machine learning problems. In more detail we describe the genre taxonomy applied at http://www.garageband.com, from where the data-set was collected, and how the taxonomy have been fused into a more human...... understandable taxonomy. Finally, a description of various features extracted from both the audio and text are presented....

  10. Artificial intelligence (AI) systems for interpreting complex medical datasets.

    Science.gov (United States)

    Altman, R B

    2017-05-01

    Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability. © 2017 ASCPT.

  11. Heuristics for Relevancy Ranking of Earth Dataset Search Results

    Science.gov (United States)

    Lynnes, Christopher; Quinn, Patrick; Norton, James

    2016-01-01

    As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.

  12. Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

    Data.gov (United States)

    National Aeronautics and Space Administration — We propose to mine and utilize the combination of Earth Science dataset, metadata with usage metrics and user feedback to objectively extract relevance for improved...

  13. Organic synthesis

    Energy Technology Data Exchange (ETDEWEB)

    Thomas, S.E.

    1991-01-01

    This paper reports on reactions of organoboranes. Organoboron routes to unsaturated hydrocarbons. Boronic ester homologation. Properties of organosilicon compounds. Alkene synthesis (Peterson olefination). Allylsilanes and acylsilanes.

  14. A dataset for preparing pristine graphene-palladium nanocomposites using swollen liquid crystal templates

    Science.gov (United States)

    Vats, Tripti; Siril, Prem Felix

    2017-12-01

    Pristine graphene (G) has not received much attention as a catalyst support, presumably due to its relative inertness as compared to reduced graphene oxide (RGO). In the present work, we used swollen liquid crystals (SLCs) as nano-reactors for graphene-palladium nanocomposites synthesis. The 'soft' confinement of SLCs directs the growth of palladium (Pd) nanoparticles over the G sheets. In this dataset we include all the parameters and details of different techniques used for the characterization of G, SLCs and synthesized G-Pd nanocomposites. The synthesized G-palladium nanocomposites (Pd-G) exhibited improved catalytic activity compared with Pd-RGO and Pd nanoparticles, in the hydrogenation of nitrophenols and C-C coupling reactions.

  15. EEG datasets for motor imagery brain-computer interface.

    Science.gov (United States)

    Cho, Hohyun; Ahn, Minkyu; Ahn, Sangtae; Kwon, Moonyoung; Jun, Sung Chan

    2017-07-01

    Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states. © The Authors 2017. Published by Oxford University Press.

  16. DEVELOPING VERIFICATION SYSTEMS FOR BUILDING INFORMATION MODELS OF HERITAGE BUILDINGS WITH HETEROGENEOUS DATASETS

    Directory of Open Access Journals (Sweden)

    L. Chow

    2017-08-01

    Full Text Available The digitization and abstraction of existing buildings into building information models requires the translation of heterogeneous datasets that may include CAD, technical reports, historic texts, archival drawings, terrestrial laser scanning, and photogrammetry into model elements. In this paper, we discuss a project undertaken by the Carleton Immersive Media Studio (CIMS that explored the synthesis of heterogeneous datasets for the development of a building information model (BIM for one of Canada’s most significant heritage assets – the Centre Block of the Parliament Hill National Historic Site. The scope of the project included the development of an as-found model of the century-old, six-story building in anticipation of specific model uses for an extensive rehabilitation program. The as-found Centre Block model was developed in Revit using primarily point cloud data from terrestrial laser scanning. The data was captured by CIMS in partnership with Heritage Conservation Services (HCS, Public Services and Procurement Canada (PSPC, using a Leica C10 and P40 (exterior and large interior spaces and a Faro Focus (small to mid-sized interior spaces. Secondary sources such as archival drawings, photographs, and technical reports were referenced in cases where point cloud data was not available. As a result of working with heterogeneous data sets, a verification system was introduced in order to communicate to model users/viewers the source of information for each building element within the model.

  17. Developing Verification Systems for Building Information Models of Heritage Buildings with Heterogeneous Datasets

    Science.gov (United States)

    Chow, L.; Fai, S.

    2017-08-01

    The digitization and abstraction of existing buildings into building information models requires the translation of heterogeneous datasets that may include CAD, technical reports, historic texts, archival drawings, terrestrial laser scanning, and photogrammetry into model elements. In this paper, we discuss a project undertaken by the Carleton Immersive Media Studio (CIMS) that explored the synthesis of heterogeneous datasets for the development of a building information model (BIM) for one of Canada's most significant heritage assets - the Centre Block of the Parliament Hill National Historic Site. The scope of the project included the development of an as-found model of the century-old, six-story building in anticipation of specific model uses for an extensive rehabilitation program. The as-found Centre Block model was developed in Revit using primarily point cloud data from terrestrial laser scanning. The data was captured by CIMS in partnership with Heritage Conservation Services (HCS), Public Services and Procurement Canada (PSPC), using a Leica C10 and P40 (exterior and large interior spaces) and a Faro Focus (small to mid-sized interior spaces). Secondary sources such as archival drawings, photographs, and technical reports were referenced in cases where point cloud data was not available. As a result of working with heterogeneous data sets, a verification system was introduced in order to communicate to model users/viewers the source of information for each building element within the model.

  18. Comparison of CORA and EN4 in-situ datasets validation methods, toward a better quality merged dataset.

    Science.gov (United States)

    Szekely, Tanguy; Killick, Rachel; Gourrion, Jerome; Reverdin, Gilles

    2017-04-01

    CORA and EN4 are both global delayed time mode validated in-situ ocean temperature and salinity datasets distributed by the Met Office (http://www.metoffice.gov.uk/) and Copernicus (www.marine.copernicus.eu). A large part of the profiles distributed by CORA and EN4 in recent years are Argo profiles from the ARGO DAC, but profiles are also extracted from the World Ocean Database and TESAC profiles from GTSPP. In the case of CORA, data coming from the EUROGOOS Regional operationnal oserving system( ROOS) operated by European institutes no managed by National Data Centres and other datasets of profiles povided by scientific sources can also be found (Sea mammals profiles from MEOP, XBT datasets from cruises ...). (EN4 also takes data from the ASBO dataset to supplement observations in the Arctic). First advantage of this new merge product is to enhance the space and time coverage at global and european scales for the period covering 1950 till a year before the current year. This product is updated once a year and T&S gridded fields are alos generated for the period 1990-year n-1. The enhancement compared to the revious CORA product will be presented Despite the fact that the profiles distributed by both datasets are mostly the same, the quality control procedures developed by the Met Office and Copernicus teams differ, sometimes leading to different quality control flags for the same profile. Started in 2016 a new study started that aims to compare both validation procedures to move towards a Copernicus Marine Service dataset with the best features of CORA and EN4 validation.A reference data set composed of the full set of in-situ temperature and salinity measurements collected by Coriolis during 2015 is used. These measurements have been made thanks to wide range of instruments (XBTs, CTDs, Argo floats, Instrumented sea mammals,...), covering the global ocean. The reference dataset has been validated simultaneously by both teams.An exhaustive comparison of the

  19. Wind and wave dataset for Matara, Sri Lanka

    Science.gov (United States)

    Luo, Yao; Wang, Dongxiao; Priyadarshana Gamage, Tilak; Zhou, Fenghua; Madusanka Widanage, Charith; Liu, Taiwei

    2018-01-01

    We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1) is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017) is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447).

  20. The LANDFIRE Refresh strategy: updating the national dataset

    Science.gov (United States)

    Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley

    2013-01-01

    The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.

  1. Interactive visualization and analysis of multimodal datasets for surgical applications.

    Science.gov (United States)

    Kirmizibayrak, Can; Yim, Yeny; Wakid, Mike; Hahn, James

    2012-12-01

    Surgeons use information from multiple sources when making surgical decisions. These include volumetric datasets (such as CT, PET, MRI, and their variants), 2D datasets (such as endoscopic videos), and vector-valued datasets (such as computer simulations). Presenting all the information to the user in an effective manner is a challenging problem. In this paper, we present a visualization approach that displays the information from various sources in a single coherent view. The system allows the user to explore and manipulate volumetric datasets, display analysis of dataset values in local regions, combine 2D and 3D imaging modalities and display results of vector-based computer simulations. Several interaction methods are discussed: in addition to traditional interfaces including mouse and trackers, gesture-based natural interaction methods are shown to control these visualizations with real-time performance. An example of a medical application (medialization laryngoplasty) is presented to demonstrate how the combination of different modalities can be used in a surgical setting with our approach.

  2. Wind and wave dataset for Matara, Sri Lanka

    Directory of Open Access Journals (Sweden)

    Y. Luo

    2018-01-01

    Full Text Available We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1 is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017 is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447.

  3. Process mining in oncology using the MIMIC-III dataset

    Science.gov (United States)

    Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen

    2018-03-01

    Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.

  4. Recent Development on the NOAA's Global Surface Temperature Dataset

    Science.gov (United States)

    Zhang, H. M.; Huang, B.; Boyer, T.; Lawrimore, J. H.; Menne, M. J.; Rennie, J.

    2016-12-01

    Global Surface Temperature (GST) is one of the most widely used indicators for climate trend and extreme analyses. A widely used GST dataset is the NOAA merged land-ocean surface temperature dataset known as NOAAGlobalTemp (formerly MLOST). The NOAAGlobalTemp had recently been updated from version 3.5.4 to version 4. The update includes a significant improvement in the ocean surface component (Extended Reconstructed Sea Surface Temperature or ERSST, from version 3b to version 4) which resulted in an increased temperature trends in recent decades. Since then, advancements in both the ocean component (ERSST) and land component (GHCN-Monthly) have been made, including the inclusion of Argo float SSTs and expanded EOT modes in ERSST, and the use of ISTI databank in GHCN-Monthly. In this presentation, we describe the impact of those improvements on the merged global temperature dataset, in terms of global trends and other aspects.

  5. Synthetic ALSPAC longitudinal datasets for the Big Data VR project.

    Science.gov (United States)

    Avraam, Demetris; Wilson, Rebecca C; Burton, Paul

    2017-01-01

    Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information.  In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.

  6. The OXL format for the exchange of integrated datasets

    Directory of Open Access Journals (Sweden)

    Taubert Jan

    2007-12-01

    Full Text Available A prerequisite for systems biology is the integration and analysis of heterogeneous experimental data stored in hundreds of life-science databases and millions of scientific publications. Several standardised formats for the exchange of specific kinds of biological information exist. Such exchange languages facilitate the integration process; however they are not designed to transport integrated datasets. A format for exchanging integrated datasets needs to i cover data from a broad range of application domains, ii be flexible and extensible to combine many different complex data structures, iii include metadata and semantic definitions, iv include inferred information, v identify the original data source for integrated entities and vi transport large integrated datasets. Unfortunately, none of the exchange formats from the biological domain (e.g. BioPAX, MAGE-ML, PSI-MI, SBML or the generic approaches (RDF, OWL fulfil these requirements in a systematic way.

  7. Dataset of transcriptional landscape of B cell early activation

    Directory of Open Access Journals (Sweden)

    Alexander S. Garruss

    2015-09-01

    Full Text Available Signaling via B cell receptors (BCR and Toll-like receptors (TLRs result in activation of B cells with distinct physiological outcomes, but transcriptional regulatory mechanisms that drive activation and distinguish these pathways remain unknown. At early time points after BCR and TLR ligand exposure, 0.5 and 2 h, RNA-seq was performed allowing observations on rapid transcriptional changes. At 2 h, ChIP-seq was performed to allow observations on important regulatory mechanisms potentially driving transcriptional change. The dataset includes RNA-seq, ChIP-seq of control (Input, RNA Pol II, H3K4me3, H3K27me3, and a separate RNA-seq for miRNA expression, which can be found at Gene Expression Omnibus Dataset GSE61608. Here, we provide details on the experimental and analysis methods used to obtain and analyze this dataset and to examine the transcriptional landscape of B cell early activation.

  8. The Global Precipitation Climatology Project (GPCP) Combined Precipitation Dataset

    Science.gov (United States)

    Huffman, George J.; Adler, Robert F.; Arkin, Philip; Chang, Alfred; Ferraro, Ralph; Gruber, Arnold; Janowiak, John; McNab, Alan; Rudolf, Bruno; Schneider, Udo

    1997-01-01

    The Global Precipitation Climatology Project (GPCP) has released the GPCP Version 1 Combined Precipitation Data Set, a global, monthly precipitation dataset covering the period July 1987 through December 1995. The primary product in the dataset is a merged analysis incorporating precipitation estimates from low-orbit-satellite microwave data, geosynchronous-orbit -satellite infrared data, and rain gauge observations. The dataset also contains the individual input fields, a combination of the microwave and infrared satellite estimates, and error estimates for each field. The data are provided on 2.5 deg x 2.5 deg latitude-longitude global grids. Preliminary analyses show general agreement with prior studies of global precipitation and extends prior studies of El Nino-Southern Oscillation precipitation patterns. At the regional scale there are systematic differences with standard climatologies.

  9. A high-resolution European dataset for hydrologic modeling

    Science.gov (United States)

    Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

    2013-04-01

    There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as

  10. Visualization of conserved structures by fusing highly variable datasets.

    Science.gov (United States)

    Silverstein, Jonathan C; Chhadia, Ankur; Dech, Fred

    2002-01-01

    Skill, effort, and time are required to identify and visualize anatomic structures in three-dimensions from radiological data. Fundamentally, automating these processes requires a technique that uses symbolic information not in the dynamic range of the voxel data. We were developing such a technique based on mutual information for automatic multi-modality image fusion (MIAMI Fuse, University of Michigan). This system previously demonstrated facility at fusing one voxel dataset with integrated symbolic structure information to a CT dataset (different scale and resolution) from the same person. The next step of development of our technique was aimed at accommodating the variability of anatomy from patient to patient by using warping to fuse our standard dataset to arbitrary patient CT datasets. A standard symbolic information dataset was created from the full color Visible Human Female by segmenting the liver parenchyma, portal veins, and hepatic veins and overwriting each set of voxels with a fixed color. Two arbitrarily selected patient CT scans of the abdomen were used for reference datasets. We used the warping functions in MIAMI Fuse to align the standard structure data to each patient scan. The key to successful fusion was the focused use of multiple warping control points that place themselves around the structure of interest automatically. The user assigns only a few initial control points to align the scans. Fusion 1 and 2 transformed the atlas with 27 points around the liver to CT1 and CT2 respectively. Fusion 3 transformed the atlas with 45 control points around the liver to CT1 and Fusion 4 transformed the atlas with 5 control points around the portal vein. The CT dataset is augmented with the transformed standard structure dataset, such that the warped structure masks are visualized in combination with the original patient dataset. This combined volume visualization is then rendered interactively in stereo on the ImmersaDesk in an immersive Virtual

  11. A cross-country Exchange Market Pressure (EMP) dataset.

    Science.gov (United States)

    Desai, Mohit; Patnaik, Ila; Felman, Joshua; Shah, Ajay

    2017-06-01

    The data presented in this article are related to the research article titled - "An exchange market pressure measure for cross country analysis" (Patnaik et al. [1]). In this article, we present the dataset for Exchange Market Pressure values (EMP) for 139 countries along with their conversion factors, ρ (rho). Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values) for the point estimates of ρ 's. Using the standard errors of estimates of ρ 's, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.

  12. Dataset of herbarium specimens of threatened vascular plants in Catalonia.

    Science.gov (United States)

    Nualart, Neus; Ibáñez, Neus; Luque, Pere; Pedrol, Joan; Vilar, Lluís; Guàrdia, Roser

    2017-01-01

    This data paper describes a specimens' dataset of the Catalonian threatened vascular plants conserved in five public Catalonian herbaria (BC, BCN, HGI, HBIL and MTTE). Catalonia is an administrative region of Spain that includes large autochthon plants diversity and 199 taxa with IUCN threatened categories (EX, EW, RE, CR, EN and VU). This dataset includes 1,618 records collected from 17 th century to nowadays. For each specimen, the species name, locality indication, collection date, collector, ecology and revision label are recorded. More than 94% of the taxa are represented in the herbaria, which evidence the paper of the botanical collections as an essential source of occurrence data.

  13. A Large-Scale 3D Object Recognition dataset

    DEFF Research Database (Denmark)

    Sølund, Thomas; Glent Buch, Anders; Krüger, Norbert

    2016-01-01

    geometric groups; concave, convex, cylindrical and flat 3D object models. The object models have varying amount of local geometric features to challenge existing local shape feature descriptors in terms of descriptiveness and robustness. The dataset is validated in a benchmark which evaluates the matching...... performance of 7 different state-of-the-art local shape descriptors. Further, we validate the dataset in a 3D object recognition pipeline. Our benchmark shows as expected that local shape feature descriptors without any global point relation across the surface have a poor matching performance with flat...

  14. Traffic sign classification with dataset augmentation and convolutional neural network

    Science.gov (United States)

    Tang, Qing; Kurnianggoro, Laksono; Jo, Kang-Hyun

    2018-04-01

    This paper presents a method for traffic sign classification using a convolutional neural network (CNN). In this method, firstly we transfer a color image into grayscale, and then normalize it in the range (-1,1) as the preprocessing step. To increase robustness of classification model, we apply a dataset augmentation algorithm and create new images to train the model. To avoid overfitting, we utilize a dropout module before the last fully connection layer. To assess the performance of the proposed method, the German traffic sign recognition benchmark (GTSRB) dataset is utilized. Experimental results show that the method is effective in classifying traffic signs.

  15. Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

    Science.gov (United States)

    Spjuth, Ola; Willighagen, Egon L; Guha, Rajarshi; Eklund, Martin; Wikberg, Jarl Es

    2010-06-30

    QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but

  16. Towards interoperable and reproducible QSAR analyses: Exchange of datasets

    Directory of Open Access Journals (Sweden)

    Spjuth Ola

    2010-06-01

    Full Text Available Abstract Background QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. Results We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Conclusions Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join

  17. The Wind Integration National Dataset (WIND) toolkit (Presentation)

    Energy Technology Data Exchange (ETDEWEB)

    Caroline Draxl: NREL

    2014-01-01

    Regional wind integration studies require detailed wind power output data at many locations to perform simulations of how the power system will operate under high penetration scenarios. The wind datasets that serve as inputs into the study must realistically reflect the ramping characteristics, spatial and temporal correlations, and capacity factors of the simulated wind plants, as well as being time synchronized with available load profiles.As described in this presentation, the WIND Toolkit fulfills these requirements by providing a state-of-the-art national (US) wind resource, power production and forecast dataset.

  18. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

    Science.gov (United States)

    Lary, D. J.

    2013-12-01

    A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.

  19. Would the ‘real’ observed dataset stand up? A critical examination of eight observed gridded climate datasets for China

    International Nuclear Information System (INIS)

    Sun, Qiaohong; Miao, Chiyuan; Duan, Qingyun; Kong, Dongxian; Ye, Aizhong; Di, Zhenhua; Gong, Wei

    2014-01-01

    This research compared and evaluated the spatio-temporal similarities and differences of eight widely used gridded datasets. The datasets include daily precipitation over East Asia (EA), the Climate Research Unit (CRU) product, the Global Precipitation Climatology Centre (GPCC) product, the University of Delaware (UDEL) product, Precipitation Reconstruction over Land (PREC/L), the Asian Precipitation Highly Resolved Observational (APHRO) product, the Institute of Atmospheric Physics (IAP) dataset from the Chinese Academy of Sciences, and the National Meteorological Information Center dataset from the China Meteorological Administration (CN05). The meteorological variables focus on surface air temperature (SAT) or precipitation (PR) in China. All datasets presented general agreement on the whole spatio-temporal scale, but some differences appeared for specific periods and regions. On a temporal scale, EA shows the highest amount of PR, while APHRO shows the lowest. CRU and UDEL show higher SAT than IAP or CN05. On a spatial scale, the most significant differences occur in western China for PR and SAT. For PR, the difference between EA and CRU is the largest. When compared with CN05, CRU shows higher SAT in the central and southern Northwest river drainage basin, UDEL exhibits higher SAT over the Southwest river drainage system, and IAP has lower SAT in the Tibetan Plateau. The differences in annual mean PR and SAT primarily come from summer and winter, respectively. Finally, potential factors impacting agreement among gridded climate datasets are discussed, including raw data sources, quality control (QC) schemes, orographic correction, and interpolation techniques. The implications and challenges of these results for climate research are also briefly addressed. (paper)

  20. Organic synthesis

    International Nuclear Information System (INIS)

    Lallemand, J.Y.; Fetizon, M.

    1988-01-01

    The 1988 progress report of the Organic Synthesis Chemistry laboratory (Polytechnic School, France), is presented. The laboratory activities are centered on the chemistry of natural products, which have a biological activity and on the development of new reactions, useful in the organic synthesis. The research works involve the following domains: the natural products chemistry which are applied in pharmacology, the plants and insects chemistry, the organic synthesis, the radical chemistry new reactions and the bio-organic physicochemistry. The published papers, the congress communications and the thesis are listed [fr

  1. Using Real Datasets for Interdisciplinary Business/Economics Projects

    Science.gov (United States)

    Goel, Rajni; Straight, Ronald L.

    2005-01-01

    The workplace's global and dynamic nature allows and requires improved approaches for providing business and economics education. In this article, the authors explore ways of enhancing students' understanding of course material by using nontraditional, real-world datasets of particular interest to them. Teaching at a historically Black university,…

  2. Dataset-driven research for improving recommender systems for learning

    NARCIS (Netherlands)

    Verbert, Katrien; Drachsler, Hendrik; Manouselis, Nikos; Wolpers, Martin; Vuorikari, Riina; Duval, Erik

    2011-01-01

    Verbert, K., Drachsler, H., Manouselis, N., Wolpers, M., Vuorikari, R., & Duval, E. (2011). Dataset-driven research for improving recommender systems for learning. In Ph. Long, & G. Siemens (Eds.), Proceedings of 1st International Conference Learning Analytics & Knowledge (pp. 44-53). February,

  3. dataTEL - Datasets for Technology Enhanced Learning

    NARCIS (Netherlands)

    Drachsler, Hendrik; Verbert, Katrien; Sicilia, Miguel-Angel; Wolpers, Martin; Manouselis, Nikos; Vuorikari, Riina; Lindstaedt, Stefanie; Fischer, Frank

    2011-01-01

    Drachsler, H., Verbert, K., Sicilia, M. A., Wolpers, M., Manouselis, N., Vuorikari, R., Lindstaedt, S., & Fischer, F. (2011). dataTEL - Datasets for Technology Enhanced Learning. STELLAR Alpine Rendez-Vous White Paper. Alpine Rendez-Vous 2011 White paper collection, Nr. 13., France (2011)

  4. A dataset of forest biomass structure for Eurasia.

    Science.gov (United States)

    Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

    2017-05-16

    The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.

  5. A reanalysis dataset of the South China Sea

    Science.gov (United States)

    Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

    2014-01-01

    Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992–2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability. PMID:25977803

  6. Comparision of analysis of the QTLMAS XII common dataset

    DEFF Research Database (Denmark)

    Crooks, Lucy; Sahana, Goutam; de Koning, Dirk-Jan

    2009-01-01

    As part of the QTLMAS XII workshop, a simulated dataset was distributed and participants were invited to submit analyses of the data based on genome-wide association, fine mapping and genomic selection. We have evaluated the findings from the groups that reported fine mapping and genome-wide asso...

  7. The LAMBADA dataset: Word prediction requiring a broad discourse context

    NARCIS (Netherlands)

    Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, Q.N.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernández, R.; Erk, K.; Smith, N.A.

    2016-01-01

    We introduce LAMBADA, a dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the

  8. NEW WEB-BASED ACCESS TO NUCLEAR STRUCTURE DATASETS.

    Energy Technology Data Exchange (ETDEWEB)

    WINCHELL,D.F.

    2004-09-26

    As part of an effort to migrate the National Nuclear Data Center (NNDC) databases to a relational platform, a new web interface has been developed for the dissemination of the nuclear structure datasets stored in the Evaluated Nuclear Structure Data File and Experimental Unevaluated Nuclear Data List.

  9. Cross-Cultural Concept Mapping of Standardized Datasets

    DEFF Research Database (Denmark)

    Kano Glückstad, Fumiko

    2012-01-01

    This work compares four feature-based similarity measures derived from cognitive sciences. The purpose of the comparative analysis is to verify the potentially most effective model that can be applied for mapping independent ontologies in a culturally influenced domain [1]. Here, datasets based...

  10. Level-1 muon trigger performance with the full 2017 dataset

    CERN Document Server

    CMS Collaboration

    2018-01-01

    This document describes the performance of the CMS Level-1 Muon Trigger with the full dataset of 2017. Efficiency plots are included for each track finder (TF) individually and for the system as a whole. The efficiency is measured to be greater than 90% for all track finders.

  11. A Dataset for Visual Navigation with Neuromorphic Methods

    Directory of Open Access Journals (Sweden)

    Francisco eBarranco

    2016-02-01

    Full Text Available Standardized benchmarks in Computer Vision have greatly contributed to the advance of approaches to many problems in the field. If we want to enhance the visibility of event-driven vision and increase its impact, we will need benchmarks that allow comparison among different neuromorphic methods as well as comparison to Computer Vision conventional approaches. We present datasets to evaluate the accuracy of frame-free and frame-based approaches for tasks of visual navigation. Similar to conventional Computer Vision datasets, we provide synthetic and real scenes, with the synthetic data created with graphics packages, and the real data recorded using a mobile robotic platform carrying a dynamic and active pixel vision sensor (DAVIS and an RGB+Depth sensor. For both datasets the cameras move with a rigid motion in a static scene, and the data includes the images, events, optic flow, 3D camera motion, and the depth of the scene, along with calibration procedures. Finally, we also provide simulated event data generated synthetically from well-known frame-based optical flow datasets.

  12. Evaluation of Uncertainty in Precipitation Datasets for New Mexico, USA

    Science.gov (United States)

    Besha, A. A.; Steele, C. M.; Fernald, A.

    2014-12-01

    Climate change, population growth and other factors are endangering water availability and sustainability in semiarid/arid areas particularly in the southwestern United States. Wide coverage of spatial and temporal measurements of precipitation are key for regional water budget analysis and hydrological operations which themselves are valuable tool for water resource planning and management. Rain gauge measurements are usually reliable and accurate at a point. They measure rainfall continuously, but spatial sampling is limited. Ground based radar and satellite remotely sensed precipitation have wide spatial and temporal coverage. However, these measurements are indirect and subject to errors because of equipment, meteorological variability, the heterogeneity of the land surface itself and lack of regular recording. This study seeks to understand precipitation uncertainty and in doing so, lessen uncertainty propagation into hydrological applications and operations. We reviewed, compared and evaluated the TRMM (Tropical Rainfall Measuring Mission) precipitation products, NOAA's (National Oceanic and Atmospheric Administration) Global Precipitation Climatology Centre (GPCC) monthly precipitation dataset, PRISM (Parameter elevation Regression on Independent Slopes Model) data and data from individual climate stations including Cooperative Observer Program (COOP), Remote Automated Weather Stations (RAWS), Soil Climate Analysis Network (SCAN) and Snowpack Telemetry (SNOTEL) stations. Though not yet finalized, this study finds that the uncertainty within precipitation estimates datasets is influenced by regional topography, season, climate and precipitation rate. Ongoing work aims to further evaluate precipitation datasets based on the relative influence of these phenomena so that we can identify the optimum datasets for input to statewide water budget analysis.

  13. Dataset: Multi Sensor-Orientation Movement Data of Goats

    NARCIS (Netherlands)

    Kamminga, Jacob Wilhelm

    2018-01-01

    This is a labeled dataset. Motion data were collected from six sensor nodes that were fixed with different orientations to a collar around the neck of goats. These six sensor nodes simultaneously, with different orientations, recorded various activities performed by the goat. We recorded the

  14. A dataset of human decision-making in teamwork management

    Science.gov (United States)

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-01

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.

  15. UK surveillance: provision of quality assured information from combined datasets.

    Science.gov (United States)

    Paiba, G A; Roberts, S R; Houston, C W; Williams, E C; Smith, L H; Gibbens, J C; Holdship, S; Lysons, R

    2007-09-14

    Surveillance information is most useful when provided within a risk framework, which is achieved by presenting results against an appropriate denominator. Often the datasets are captured separately and for different purposes, and will have inherent errors and biases that can be further confounded by the act of merging. The United Kingdom Rapid Analysis and Detection of Animal-related Risks (RADAR) system contains data from several sources and provides both data extracts for research purposes and reports for wider stakeholders. Considerable efforts are made to optimise the data in RADAR during the Extraction, Transformation and Loading (ETL) process. Despite efforts to ensure data quality, the final dataset inevitably contains some data errors and biases, most of which cannot be rectified during subsequent analysis. So, in order for users to establish the 'fitness for purpose' of data merged from more than one data source, Quality Statements are produced as defined within the overarching surveillance Quality Framework. These documents detail identified data errors and biases following ETL and report construction as well as relevant aspects of the datasets from which the data originated. This paper illustrates these issues using RADAR datasets, and describes how they can be minimised.

  16. participatory development of a minimum dataset for the khayelitsha ...

    African Journals Online (AJOL)

    This dataset was integrated with data requirements at ... model for defining health information needs at district level. This participatory process has enabled health workers to appraise their .... of reproductive health, mental health, disability and community ... each chose a facilitator and met in between the forum meetings.

  17. Comparision of analysis of the QTLMAS XII common dataset

    DEFF Research Database (Denmark)

    Lund, Mogens Sandø; Sahana, Goutam; de Koning, Dirk-Jan

    2009-01-01

    A dataset was simulated and distributed to participants of the QTLMAS XII workshop who were invited to develop genomic selection models. Each contributing group was asked to describe the model development and validation as well as to submit genomic predictions for three generations of individuals...

  18. The NASA Subsonic Jet Particle Image Velocimetry (PIV) Dataset

    Science.gov (United States)

    Bridges, James; Wernet, Mark P.

    2011-01-01

    Many tasks in fluids engineering require prediction of turbulence of jet flows. The present document documents the single-point statistics of velocity, mean and variance, of cold and hot jet flows. The jet velocities ranged from 0.5 to 1.4 times the ambient speed of sound, and temperatures ranged from unheated to static temperature ratio 2.7. Further, the report assesses the accuracies of the data, e.g., establish uncertainties for the data. This paper covers the following five tasks: (1) Document acquisition and processing procedures used to create the particle image velocimetry (PIV) datasets. (2) Compare PIV data with hotwire and laser Doppler velocimetry (LDV) data published in the open literature. (3) Compare different datasets acquired at the same flow conditions in multiple tests to establish uncertainties. (4) Create a consensus dataset for a range of hot jet flows, including uncertainty bands. (5) Analyze this consensus dataset for self-consistency and compare jet characteristics to those of the open literature. The final objective was fulfilled by using the potential core length and the spread rate of the half-velocity radius to collapse of the mean and turbulent velocity fields over the first 20 jet diameters.

  19. A new dataset validation system for the Planetary Science Archive

    Science.gov (United States)

    Manaud, N.; Zender, J.; Heather, D.; Martinez, S.

    2007-08-01

    The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that

  20. Data Recommender: An Alternative Way to Discover Open Scientific Datasets

    Science.gov (United States)

    Klump, J. F.; Devaraju, A.; Williams, G.; Hogan, D.; Davy, R.; Page, J.; Singh, D.; Peterson, N.

    2017-12-01

    Over the past few years, institutions and government agencies have adopted policies to openly release their data, which has resulted in huge amounts of open data becoming available on the web. When trying to discover the data, users face two challenges: an overload of choice and the limitations of the existing data search tools. On the one hand, there are too many datasets to choose from, and therefore, users need to spend considerable effort to find the datasets most relevant to their research. On the other hand, data portals commonly offer keyword and faceted search, which depend fully on the user queries to search and rank relevant datasets. Consequently, keyword and faceted search may return loosely related or irrelevant results, although the results may contain the same query. They may also return highly specific results that depend more on how well metadata was authored. They do not account well for variance in metadata due to variance in author styles and preferences. The top-ranked results may also come from the same data collection, and users are unlikely to discover new and interesting datasets. These search modes mainly suits users who can express their information needs in terms of the structure and terminology of the data portals, but may pose a challenge otherwise. The above challenges reflect that we need a solution that delivers the most relevant (i.e., similar and serendipitous) datasets to users, beyond the existing search functionalities on the portals. A recommender system is an information filtering system that presents users with relevant and interesting contents based on users' context and preferences. Delivering data recommendations to users can make data discovery easier, and as a result may enhance user engagement with the portal. We developed a hybrid data recommendation approach for the CSIRO Data Access Portal. The approach leverages existing recommendation techniques (e.g., content-based filtering and item co-occurrence) to produce

  1. Comparison of global 3-D aviation emissions datasets

    Directory of Open Access Journals (Sweden)

    S. C. Olsen

    2013-01-01

    Full Text Available Aviation emissions are unique from other transportation emissions, e.g., from road transportation and shipping, in that they occur at higher altitudes as well as at the surface. Aviation emissions of carbon dioxide, soot, and water vapor have direct radiative impacts on the Earth's climate system while emissions of nitrogen oxides (NOx, sulfur oxides, carbon monoxide (CO, and hydrocarbons (HC impact air quality and climate through their effects on ozone, methane, and clouds. The most accurate estimates of the impact of aviation on air quality and climate utilize three-dimensional chemistry-climate models and gridded four dimensional (space and time aviation emissions datasets. We compare five available aviation emissions datasets currently and historically used to evaluate the impact of aviation on climate and air quality: NASA-Boeing 1992, NASA-Boeing 1999, QUANTIFY 2000, Aero2k 2002, and AEDT 2006 and aviation fuel usage estimates from the International Energy Agency. Roughly 90% of all aviation emissions are in the Northern Hemisphere and nearly 60% of all fuelburn and NOx emissions occur at cruise altitudes in the Northern Hemisphere. While these datasets were created by independent methods and are thus not strictly suitable for analyzing trends they suggest that commercial aviation fuelburn and NOx emissions increased over the last two decades while HC emissions likely decreased and CO emissions did not change significantly. The bottom-up estimates compared here are consistently lower than International Energy Agency fuelburn statistics although the gap is significantly smaller in the more recent datasets. Overall the emissions distributions are quite similar for fuelburn and NOx with regional peaks over the populated land masses of North America, Europe, and East Asia. For CO and HC there are relatively larger differences. There are however some distinct differences in the altitude distribution

  2. Geoseq: a tool for dissecting deep-sequencing datasets

    Directory of Open Access Journals (Sweden)

    Homann Robert

    2010-10-01

    Full Text Available Abstract Background Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO, Sequence Read Archive (SRA hosted by the NCBI, or the DNA Data Bank of Japan (ddbj. Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Results Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Conclusions Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a identify differential isoform expression in mRNA-seq datasets, b identify miRNAs (microRNAs in libraries, and identify mature and star sequences in miRNAS and c to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

  3. On sample size and different interpretations of snow stability datasets

    Science.gov (United States)

    Schirmer, M.; Mitterer, C.; Schweizer, J.

    2009-04-01

    Interpretations of snow stability variations need an assessment of the stability itself, independent of the scale investigated in the study. Studies on stability variations at a regional scale have often chosen stability tests such as the Rutschblock test or combinations of various tests in order to detect differences in aspect and elevation. The question arose: ‘how capable are such stability interpretations in drawing conclusions'. There are at least three possible errors sources: (i) the variance of the stability test itself; (ii) the stability variance at an underlying slope scale, and (iii) that the stability interpretation might not be directly related to the probability of skier triggering. Various stability interpretations have been proposed in the past that provide partly different results. We compared a subjective one based on expert knowledge with a more objective one based on a measure derived from comparing skier-triggered slopes vs. slopes that have been skied but not triggered. In this study, the uncertainties are discussed and their effects on regional scale stability variations will be quantified in a pragmatic way. An existing dataset with very large sample sizes was revisited. This dataset contained the variance of stability at a regional scale for several situations. The stability in this dataset was determined using the subjective interpretation scheme based on expert knowledge. The question to be answered was how many measurements were needed to obtain similar results (mainly stability differences in aspect or elevation) as with the complete dataset. The optimal sample size was obtained in several ways: (i) assuming a nominal data scale the sample size was determined with a given test, significance level and power, and by calculating the mean and standard deviation of the complete dataset. With this method it can also be determined if the complete dataset consists of an appropriate sample size. (ii) Smaller subsets were created with similar

  4. Data Discovery of Big and Diverse Climate Change Datasets - Options, Practices and Challenges

    Science.gov (United States)

    Palanisamy, G.; Boden, T.; McCord, R. A.; Frame, M. T.

    2013-12-01

    Developing data search tools is a very common, but often confusing, task for most of the data intensive scientific projects. These search interfaces need to be continually improved to handle the ever increasing diversity and volume of data collections. There are many aspects which determine the type of search tool a project needs to provide to their user community. These include: number of datasets, amount and consistency of discovery metadata, ancillary information such as availability of quality information and provenance, and availability of similar datasets from other distributed sources. Environmental Data Science and Systems (EDSS) group within the Environmental Science Division at the Oak Ridge National Laboratory has a long history of successfully managing diverse and big observational datasets for various scientific programs via various data centers such as DOE's Atmospheric Radiation Measurement Program (ARM), DOE's Carbon Dioxide Information and Analysis Center (CDIAC), USGS's Core Science Analytics and Synthesis (CSAS) metadata Clearinghouse and NASA's Distributed Active Archive Center (ORNL DAAC). This talk will showcase some of the recent developments for improving the data discovery within these centers The DOE ARM program recently developed a data discovery tool which allows users to search and discover over 4000 observational datasets. These datasets are key to the research efforts related to global climate change. The ARM discovery tool features many new functions such as filtered and faceted search logic, multi-pass data selection, filtering data based on data quality, graphical views of data quality and availability, direct access to data quality reports, and data plots. The ARM Archive also provides discovery metadata to other broader metadata clearinghouses such as ESGF, IASOA, and GOS. In addition to the new interface, ARM is also currently working on providing DOI metadata records to publishers such as Thomson Reuters and Elsevier. The ARM

  5. A multimodal MRI dataset of professional chess players.

    Science.gov (United States)

    Li, Kaiming; Jiang, Jing; Qiu, Lihua; Yang, Xun; Huang, Xiaoqi; Lui, Su; Gong, Qiyong

    2015-01-01

    Chess is a good model to study high-level human brain functions such as spatial cognition, memory, planning, learning and problem solving. Recent studies have demonstrated that non-invasive MRI techniques are valuable for researchers to investigate the underlying neural mechanism of playing chess. For professional chess players (e.g., chess grand masters and masters or GM/Ms), what are the structural and functional alterations due to long-term professional practice, and how these alterations relate to behavior, are largely veiled. Here, we report a multimodal MRI dataset from 29 professional Chinese chess players (most of whom are GM/Ms), and 29 age matched novices. We hope that this dataset will provide researchers with new materials to further explore high-level human brain functions.

  6. Knowledge discovery with classification rules in a cardiovascular dataset.

    Science.gov (United States)

    Podgorelec, Vili; Kokol, Peter; Stiglic, Milojka Molan; Hericko, Marjan; Rozman, Ivan

    2005-12-01

    In this paper we study an evolutionary machine learning approach to data mining and knowledge discovery based on the induction of classification rules. A method for automatic rules induction called AREX using evolutionary induction of decision trees and automatic programming is introduced. The proposed algorithm is applied to a cardiovascular dataset consisting of different groups of attributes which should possibly reveal the presence of some specific cardiovascular problems in young patients. A case study is presented that shows the use of AREX for the classification of patients and for discovering possible new medical knowledge from the dataset. The defined knowledge discovery loop comprises a medical expert's assessment of induced rules to drive the evolution of rule sets towards more appropriate solutions. The final result is the discovery of a possible new medical knowledge in the field of pediatric cardiology.

  7. Augmented Reality Prototype for Visualizing Large Sensors’ Datasets

    Directory of Open Access Journals (Sweden)

    Folorunso Olufemi A.

    2011-04-01

    Full Text Available This paper addressed the development of an augmented reality (AR based scientific visualization system prototype that supports identification, localisation, and 3D visualisation of oil leakages sensors datasets. Sensors generates significant amount of multivariate datasets during normal and leak situations which made data exploration and visualisation daunting tasks. Therefore a model to manage such data and enhance computational support needed for effective explorations are developed in this paper. A challenge of this approach is to reduce the data inefficiency. This paper presented a model for computing information gain for each data attributes and determine a lead attribute.The computed lead attribute is then used for the development of an AR-based scientific visualization interface which automatically identifies, localises and visualizes all necessary data relevant to a particularly selected region of interest (ROI on the network. Necessary architectural system supports and the interface requirements for such visualizations are also presented.

  8. An integrated dataset for in silico drug discovery

    Directory of Open Access Journals (Sweden)

    Cockell Simon J

    2010-12-01

    Full Text Available Drug development is expensive and prone to failure. It is potentially much less risky and expensive to reuse a drug developed for one condition for treating a second disease, than it is to develop an entirely new compound. Systematic approaches to drug repositioning are needed to increase throughput and find candidates more reliably. Here we address this need with an integrated systems biology dataset, developed using the Ondex data integration platform, for the in silico discovery of new drug repositioning candidates. We demonstrate that the information in this dataset allows known repositioning examples to be discovered. We also propose a means of automating the search for new treatment indications of existing compounds.

  9. Application of Density Estimation Methods to Datasets from a Glider

    Science.gov (United States)

    2014-09-30

    humpback and sperm whales as well as different dolphin species. OBJECTIVES The objective of this research is to extend existing methods for cetacean...collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources...estimation from single sensor datasets. Required steps for a cue counting approach, where a cue has been defined as a clicking event (Küsel et al., 2011), to

  10. A review of continent scale hydrological datasets available for Africa

    OpenAIRE

    Bonsor, H.C.

    2010-01-01

    As rainfall becomes less reliable with predicted climate change the ability to assess the spatial and seasonal variations in groundwater availability on a large-scale (catchment and continent) is becoming increasingly important (Bates, et al. 2007; MacDonald et al. 2009). The scarcity of observed hydrological data, or difficulty in obtaining such data, within Africa means remotely sensed (RS) datasets must often be used to drive large-scale hydrological models. The different ap...

  11. Dataset of mitochondrial genome variants in oncocytic tumors

    Directory of Open Access Journals (Sweden)

    Lihua Lyu

    2018-04-01

    Full Text Available This dataset presents the mitochondrial genome variants associated with oncocytic tumors. These data were obtained by Sanger sequencing of the whole mitochondrial genomes of oncocytic tumors and the adjacent normal tissues from 32 patients. The mtDNA variants are identified after compared with the revised Cambridge sequence, excluding those defining haplogroups of our patients. The pathogenic prediction for the novel missense variants found in this study was performed with the Mitimpact 2 program.

  12. GLEAM version 3: Global Land Evaporation Datasets and Model

    Science.gov (United States)

    Martens, B.; Miralles, D. G.; Lievens, H.; van der Schalie, R.; de Jeu, R.; Fernandez-Prieto, D.; Verhoest, N.

    2015-12-01

    Terrestrial evaporation links energy, water and carbon cycles over land and is therefore a key variable of the climate system. However, the global-scale magnitude and variability of the flux, and the sensitivity of the underlying physical process to changes in environmental factors, are still poorly understood due to limitations in in situ measurements. As a result, several methods have risen to estimate global patterns of land evaporation from satellite observations. However, these algorithms generally differ in their approach to model evaporation, resulting in large differences in their estimates. One of these methods is GLEAM, the Global Land Evaporation: the Amsterdam Methodology. GLEAM estimates terrestrial evaporation based on daily satellite observations of meteorological variables, vegetation characteristics and soil moisture. Since the publication of the first version of the algorithm (2011), the model has been widely applied to analyse trends in the water cycle and land-atmospheric feedbacks during extreme hydrometeorological events. A third version of the GLEAM global datasets is foreseen by the end of 2015. Given the relevance of having a continuous and reliable record of global-scale evaporation estimates for climate and hydrological research, the establishment of an online data portal to host these data to the public is also foreseen. In this new release of the GLEAM datasets, different components of the model have been updated, with the most significant change being the revision of the data assimilation algorithm. In this presentation, we will highlight the most important changes of the methodology and present three new GLEAM datasets and their validation against in situ observations and an alternative dataset of terrestrial evaporation (ERA-Land). Results of the validation exercise indicate that the magnitude and the spatiotemporal variability of the modelled evaporation agree reasonably well with the estimates of ERA-Land and the in situ

  13. Soil chemistry in lithologically diverse datasets: the quartz dilution effect

    Science.gov (United States)

    Bern, Carleton R.

    2009-01-01

    National- and continental-scale soil geochemical datasets are likely to move our understanding of broad soil geochemistry patterns forward significantly. Patterns of chemistry and mineralogy delineated from these datasets are strongly influenced by the composition of the soil parent material, which itself is largely a function of lithology and particle size sorting. Such controls present a challenge by obscuring subtler patterns arising from subsequent pedogenic processes. Here the effect of quartz concentration is examined in moist-climate soils from a pilot dataset of the North American Soil Geochemical Landscapes Project. Due to variable and high quartz contents (6.2–81.7 wt.%), and its residual and inert nature in soil, quartz is demonstrated to influence broad patterns in soil chemistry. A dilution effect is observed whereby concentrations of various elements are significantly and strongly negatively correlated with quartz. Quartz content drives artificial positive correlations between concentrations of some elements and obscures negative correlations between others. Unadjusted soil data show the highly mobile base cations Ca, Mg, and Na to be often strongly positively correlated with intermediately mobile Al or Fe, and generally uncorrelated with the relatively immobile high-field-strength elements (HFS) Ti and Nb. Both patterns are contrary to broad expectations for soils being weathered and leached. After transforming bulk soil chemistry to a quartz-free basis, the base cations are generally uncorrelated with Al and Fe, and negative correlations generally emerge with the HFS elements. Quartz-free element data may be a useful tool for elucidating patterns of weathering or parent-material chemistry in large soil datasets.

  14. Dataset on records of Hericium erinaceus in Slovakia

    OpenAIRE

    Vladimír Kunca; Marek Čiliak

    2017-01-01

    The data presented in this article are related to the research article entitled ?Habitat preferences of Hericium erinaceus in Slovakia? (Kunca and ?iliak, 2016) [FUNECO607] [2]. The dataset include all available and unpublished data from Slovakia, besides the records from the same tree or stem. We compiled a database of records of collections by processing data from herbaria, personal records and communication with mycological activists. Data on altitude, tree species, host tree vital status,...

  15. Diffeomorphic Iterative Centroid Methods for Template Estimation on Large Datasets

    OpenAIRE

    Cury , Claire; Glaunès , Joan Alexis; Colliot , Olivier

    2014-01-01

    International audience; A common approach for analysis of anatomical variability relies on the stimation of a template representative of the population. The Large Deformation Diffeomorphic Metric Mapping is an attractive framework for that purpose. However, template estimation using LDDMM is computationally expensive, which is a limitation for the study of large datasets. This paper presents an iterative method which quickly provides a centroid of the population in the shape space. This centr...

  16. A Dataset from TIMSS to Examine the Relationship between Computer Use and Mathematics Achievement

    Science.gov (United States)

    Kadijevich, Djordje M.

    2015-01-01

    Because the relationship between computer use and achievement is still puzzling, there is a need to prepare and analyze good quality datasets on computer use and achievement. Such a dataset can be derived from TIMSS data. This paper describes how this dataset can be prepared. It also gives an example of how the dataset may be analyzed. The…

  17. An Analysis on Better Testing than Training Performances on the Iris Dataset

    NARCIS (Netherlands)

    Schutten, Marten; Wiering, Marco

    2016-01-01

    The Iris dataset is a well known dataset containing information on three different types of Iris flowers. A typical and popular method for solving classification problems on datasets such as the Iris set is the support vector machine (SVM). In order to do so the dataset is separated in a set used

  18. Parton Distributions based on a Maximally Consistent Dataset

    Science.gov (United States)

    Rojo, Juan

    2016-04-01

    The choice of data that enters a global QCD analysis can have a substantial impact on the resulting parton distributions and their predictions for collider observables. One of the main reasons for this has to do with the possible presence of inconsistencies, either internal within an experiment or external between different experiments. In order to assess the robustness of the global fit, different definitions of a conservative PDF set, that is, a PDF set based on a maximally consistent dataset, have been introduced. However, these approaches are typically affected by theory biases in the selection of the dataset. In this contribution, after a brief overview of recent NNPDF developments, we propose a new, fully objective, definition of a conservative PDF set, based on the Bayesian reweighting approach. Using the new NNPDF3.0 framework, we produce various conservative sets, which turn out to be mutually in agreement within the respective PDF uncertainties, as well as with the global fit. We explore some of their implications for LHC phenomenology, finding also good consistency with the global fit result. These results provide a non-trivial validation test of the new NNPDF3.0 fitting methodology, and indicate that possible inconsistencies in the fitted dataset do not affect substantially the global fit PDFs.

  19. New public dataset for spotting patterns in medieval document images

    Science.gov (United States)

    En, Sovann; Nicolas, Stéphane; Petitjean, Caroline; Jurie, Frédéric; Heutte, Laurent

    2017-01-01

    With advances in technology, a large part of our cultural heritage is becoming digitally available. In particular, in the field of historical document image analysis, there is now a growing need for indexing and data mining tools, thus allowing us to spot and retrieve the occurrences of an object of interest, called a pattern, in a large database of document images. Patterns may present some variability in terms of color, shape, or context, making the spotting of patterns a challenging task. Pattern spotting is a relatively new field of research, still hampered by the lack of available annotated resources. We present a new publicly available dataset named DocExplore dedicated to spotting patterns in historical document images. The dataset contains 1500 images and 1464 queries, and allows the evaluation of two tasks: image retrieval and pattern localization. A standardized benchmark protocol along with ad hoc metrics is provided for a fair comparison of the submitted approaches. We also provide some first results obtained with our baseline system on this new dataset, which show that there is room for improvement and that should encourage researchers of the document image analysis community to design new systems and submit improved results.

  20. Kernel-based discriminant feature extraction using a representative dataset

    Science.gov (United States)

    Li, Honglin; Sancho Gomez, Jose-Luis; Ahalt, Stanley C.

    2002-07-01

    Discriminant Feature Extraction (DFE) is widely recognized as an important pre-processing step in classification applications. Most DFE algorithms are linear and thus can only explore the linear discriminant information among the different classes. Recently, there has been several promising attempts to develop nonlinear DFE algorithms, among which is Kernel-based Feature Extraction (KFE). The efficacy of KFE has been experimentally verified by both synthetic data and real problems. However, KFE has some known limitations. First, KFE does not work well for strongly overlapped data. Second, KFE employs all of the training set samples during the feature extraction phase, which can result in significant computation when applied to very large datasets. Finally, KFE can result in overfitting. In this paper, we propose a substantial improvement to KFE that overcomes the above limitations by using a representative dataset, which consists of critical points that are generated from data-editing techniques and centroid points that are determined by using the Frequency Sensitive Competitive Learning (FSCL) algorithm. Experiments show that this new KFE algorithm performs well on significantly overlapped datasets, and it also reduces computational complexity. Further, by controlling the number of centroids, the overfitting problem can be effectively alleviated.

  1. Decoys Selection in Benchmarking Datasets: Overview and Perspectives

    Science.gov (United States)

    Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu

    2018-01-01

    Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509

  2. ENHANCED DATA DISCOVERABILITY FOR IN SITU HYPERSPECTRAL DATASETS

    Directory of Open Access Journals (Sweden)

    B. Rasaiah

    2016-06-01

    Full Text Available Field spectroscopic metadata is a central component in the quality assurance, reliability, and discoverability of hyperspectral data and the products derived from it. Cataloguing, mining, and interoperability of these datasets rely upon the robustness of metadata protocols for field spectroscopy, and on the software architecture to support the exchange of these datasets. Currently no standard for in situ spectroscopy data or metadata protocols exist. This inhibits the effective sharing of growing volumes of in situ spectroscopy datasets, to exploit the benefits of integrating with the evolving range of data sharing platforms. A core metadataset for field spectroscopy was introduced by Rasaiah et al., (2011-2015 with extended support for specific applications. This paper presents a prototype model for an OGC and ISO compliant platform-independent metadata discovery service aligned to the specific requirements of field spectroscopy. In this study, a proof-of-concept metadata catalogue has been described and deployed in a cloud-based architecture as a demonstration of an operationalized field spectroscopy metadata standard and web-based discovery service.

  3. Multiresolution persistent homology for excessively large biomolecular datasets

    Energy Technology Data Exchange (ETDEWEB)

    Xia, Kelin; Zhao, Zhixiong [Department of Mathematics, Michigan State University, East Lansing, Michigan 48824 (United States); Wei, Guo-Wei, E-mail: wei@math.msu.edu [Department of Mathematics, Michigan State University, East Lansing, Michigan 48824 (United States); Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824 (United States); Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824 (United States)

    2015-10-07

    Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.

  4. Tissue-Based MRI Intensity Standardization: Application to Multicentric Datasets

    Directory of Open Access Journals (Sweden)

    Nicolas Robitaille

    2012-01-01

    Full Text Available Intensity standardization in MRI aims at correcting scanner-dependent intensity variations. Existing simple and robust techniques aim at matching the input image histogram onto a standard, while we think that standardization should aim at matching spatially corresponding tissue intensities. In this study, we present a novel automatic technique, called STI for STandardization of Intensities, which not only shares the simplicity and robustness of histogram-matching techniques, but also incorporates tissue spatial intensity information. STI uses joint intensity histograms to determine intensity correspondence in each tissue between the input and standard images. We compared STI to an existing histogram-matching technique on two multicentric datasets, Pilot E-ADNI and ADNI, by measuring the intensity error with respect to the standard image after performing nonlinear registration. The Pilot E-ADNI dataset consisted in 3 subjects each scanned in 7 different sites. The ADNI dataset consisted in 795 subjects scanned in more than 50 different sites. STI was superior to the histogram-matching technique, showing significantly better intensity matching for the brain white matter with respect to the standard image.

  5. Exploring massive, genome scale datasets with the genometricorr package

    KAUST Repository

    Favorov, Alexander; Mularoni, Loris; Cope, Leslie M.; Medvedeva, Yulia; Mironov, Andrey A.; Makeev, Vsevolod J.; Wheelan, Sarah J.

    2012-01-01

    We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets. Availability and implementation: The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor. © 2012 Favorov et al.

  6. Image segmentation evaluation for very-large datasets

    Science.gov (United States)

    Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

    2016-03-01

    With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.

  7. Exploring massive, genome scale datasets with the genometricorr package

    KAUST Repository

    Favorov, Alexander

    2012-05-31

    We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets. Availability and implementation: The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor. © 2012 Favorov et al.

  8. Principal Component Analysis of Process Datasets with Missing Values

    Directory of Open Access Journals (Sweden)

    Kristen A. Severson

    2017-07-01

    Full Text Available Datasets with missing values arising from causes such as sensor failure, inconsistent sampling rates, and merging data from different systems are common in the process industry. Methods for handling missing data typically operate during data pre-processing, but can also occur during model building. This article considers missing data within the context of principal component analysis (PCA, which is a method originally developed for complete data that has widespread industrial application in multivariate statistical process control. Due to the prevalence of missing data and the success of PCA for handling complete data, several PCA algorithms that can act on incomplete data have been proposed. Here, algorithms for applying PCA to datasets with missing values are reviewed. A case study is presented to demonstrate the performance of the algorithms and suggestions are made with respect to choosing which algorithm is most appropriate for particular settings. An alternating algorithm based on the singular value decomposition achieved the best results in the majority of test cases involving process datasets.

  9. A cross-country Exchange Market Pressure (EMP dataset

    Directory of Open Access Journals (Sweden)

    Mohit Desai

    2017-06-01

    Full Text Available The data presented in this article are related to the research article titled - “An exchange market pressure measure for cross country analysis” (Patnaik et al. [1]. In this article, we present the dataset for Exchange Market Pressure values (EMP for 139 countries along with their conversion factors, ρ (rho. Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values for the point estimates of ρ’s. Using the standard errors of estimates of ρ’s, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.

  10. The Role of Datasets on Scientific Influence within Conflict Research

    Science.gov (United States)

    Van Holt, Tracy; Johnson, Jeffery C.; Moates, Shiloh; Carley, Kathleen M.

    2016-01-01

    We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving “conflict” in the Web of Science (WoS) over a 66-year period (1945–2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed—such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957–1971 where ideas didn’t persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped

  11. The Role of Datasets on Scientific Influence within Conflict Research.

    Directory of Open Access Journals (Sweden)

    Tracy Van Holt

    Full Text Available We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving "conflict" in the Web of Science (WoS over a 66-year period (1945-2011. We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA, a specialized social network analysis on this citation network (~1.5 million works, to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed-such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957-1971 where ideas didn't persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993. The critical path consisted of a number of key features: 1 Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2 Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3 We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography. Publically available conflict datasets developed early on helped

  12. The Role of Datasets on Scientific Influence within Conflict Research.

    Science.gov (United States)

    Van Holt, Tracy; Johnson, Jeffery C; Moates, Shiloh; Carley, Kathleen M

    2016-01-01

    We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving "conflict" in the Web of Science (WoS) over a 66-year period (1945-2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed-such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957-1971 where ideas didn't persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped shape the

  13. Animated analysis of geoscientific datasets: An interactive graphical application

    Science.gov (United States)

    Morse, Peter; Reading, Anya; Lueg, Christopher

    2017-12-01

    Geoscientists are required to analyze and draw conclusions from increasingly large volumes of data. There is a need to recognise and characterise features and changing patterns of Earth observables within such large datasets. It is also necessary to identify significant subsets of the data for more detailed analysis. We present an innovative, interactive software tool and workflow to visualise, characterise, sample and tag large geoscientific datasets from both local and cloud-based repositories. It uses an animated interface and human-computer interaction to utilise the capacity of human expert observers to identify features via enhanced visual analytics. 'Tagger' enables users to analyze datasets that are too large in volume to be drawn legibly on a reasonable number of single static plots. Users interact with the moving graphical display, tagging data ranges of interest for subsequent attention. The tool provides a rapid pre-pass process using fast GPU-based OpenGL graphics and data-handling and is coded in the Quartz Composer visual programing language (VPL) on Mac OSX. It makes use of interoperable data formats, and cloud-based (or local) data storage and compute. In a case study, Tagger was used to characterise a decade (2000-2009) of data recorded by the Cape Sorell Waverider Buoy, located approximately 10 km off the west coast of Tasmania, Australia. These data serve as a proxy for the understanding of Southern Ocean storminess, which has both local and global implications. This example shows use of the tool to identify and characterise 4 different types of storm and non-storm events during this time. Events characterised in this way are compared with conventional analysis, noting advantages and limitations of data analysis using animation and human interaction. Tagger provides a new ability to make use of humans as feature detectors in computer-based analysis of large-volume geosciences and other data.

  14. Designing the colorectal cancer core dataset in Iran

    Directory of Open Access Journals (Sweden)

    Sara Dorri

    2017-01-01

    Full Text Available Background: There is no need to explain the importance of collection, recording and analyzing the information of disease in any health organization. In this regard, systematic design of standard data sets can be helpful to record uniform and consistent information. It can create interoperability between health care systems. The main purpose of this study was design the core dataset to record colorectal cancer information in Iran. Methods: For the design of the colorectal cancer core data set, a combination of literature review and expert consensus were used. In the first phase, the draft of the data set was designed based on colorectal cancer literature review and comparative studies. Then, in the second phase, this data set was evaluated by experts from different discipline such as medical informatics, oncology and surgery. Their comments and opinion were taken. In the third phase refined data set, was evaluated again by experts and eventually data set was proposed. Results: In first phase, based on the literature review, a draft set of 85 data elements was designed. In the second phase this data set was evaluated by experts and supplementary information was offered by professionals in subgroups especially in treatment part. In this phase the number of elements totally were arrived to 93 numbers. In the third phase, evaluation was conducted by experts and finally this dataset was designed in five main parts including: demographic information, diagnostic information, treatment information, clinical status assessment information, and clinical trial information. Conclusion: In this study the comprehensive core data set of colorectal cancer was designed. This dataset in the field of collecting colorectal cancer information can be useful through facilitating exchange of health information. Designing such data set for similar disease can help providers to collect standard data from patients and can accelerate retrieval from storage systems.

  15. FTSPlot: fast time series visualization for large datasets.

    Directory of Open Access Journals (Sweden)

    Michael Riss

    Full Text Available The analysis of electrophysiological recordings often involves visual inspection of time series data to locate specific experiment epochs, mask artifacts, and verify the results of signal processing steps, such as filtering or spike detection. Long-term experiments with continuous data acquisition generate large amounts of data. Rapid browsing through these massive datasets poses a challenge to conventional data plotting software because the plotting time increases proportionately to the increase in the volume of data. This paper presents FTSPlot, which is a visualization concept for large-scale time series datasets using techniques from the field of high performance computer graphics, such as hierarchic level of detail and out-of-core data handling. In a preprocessing step, time series data, event, and interval annotations are converted into an optimized data format, which then permits fast, interactive visualization. The preprocessing step has a computational complexity of O(n x log(N; the visualization itself can be done with a complexity of O(1 and is therefore independent of the amount of data. A demonstration prototype has been implemented and benchmarks show that the technology is capable of displaying large amounts of time series data, event, and interval annotations lag-free with < 20 ms ms. The current 64-bit implementation theoretically supports datasets with up to 2(64 bytes, on the x86_64 architecture currently up to 2(48 bytes are supported, and benchmarks have been conducted with 2(40 bytes/1 TiB or 1.3 x 10(11 double precision samples. The presented software is freely available and can be included as a Qt GUI component in future software projects, providing a standard visualization method for long-term electrophysiological experiments.

  16. A synthetic dataset for evaluating soft and hard fusion algorithms

    Science.gov (United States)

    Graham, Jacob L.; Hall, David L.; Rimland, Jeffrey

    2011-06-01

    There is an emerging demand for the development of data fusion techniques and algorithms that are capable of combining conventional "hard" sensor inputs such as video, radar, and multispectral sensor data with "soft" data including textual situation reports, open-source web information, and "hard/soft" data such as image or video data that includes human-generated annotations. New techniques that assist in sense-making over a wide range of vastly heterogeneous sources are critical to improving tactical situational awareness in counterinsurgency (COIN) and other asymmetric warfare situations. A major challenge in this area is the lack of realistic datasets available for test and evaluation of such algorithms. While "soft" message sets exist, they tend to be of limited use for data fusion applications due to the lack of critical message pedigree and other metadata. They also lack corresponding hard sensor data that presents reasonable "fusion opportunities" to evaluate the ability to make connections and inferences that span the soft and hard data sets. This paper outlines the design methodologies, content, and some potential use cases of a COIN-based synthetic soft and hard dataset created under a United States Multi-disciplinary University Research Initiative (MURI) program funded by the U.S. Army Research Office (ARO). The dataset includes realistic synthetic reports from a variety of sources, corresponding synthetic hard data, and an extensive supporting database that maintains "ground truth" through logical grouping of related data into "vignettes." The supporting database also maintains the pedigree of messages and other critical metadata.

  17. Identifying frauds and anomalies in Medicare-B dataset.

    Science.gov (United States)

    Jiwon Seo; Mendelevitch, Ofer

    2017-07-01

    Healthcare industry is growing at a rapid rate to reach a market value of $7 trillion dollars world wide. At the same time, fraud in healthcare is becoming a serious problem, amounting to 5% of the total healthcare spending, or $100 billion dollars each year in US. Manually detecting healthcare fraud requires much effort. Recently, machine learning and data mining techniques are applied to automatically detect healthcare frauds. This paper proposes a novel PageRank-based algorithm to detect healthcare frauds and anomalies. We apply the algorithm to Medicare-B dataset, a real-life data with 10 million healthcare insurance claims. The algorithm successfully identifies tens of previously unreported anomalies.

  18. Power analysis dataset for QCA based multiplexer circuits

    Directory of Open Access Journals (Sweden)

    Md. Abdullah-Al-Shafi

    2017-04-01

    Full Text Available Power consumption in irreversible QCA logic circuits is a vital and a major issue; however in the practical cases, this focus is mostly omitted.The complete power depletion dataset of different QCA multiplexers have been worked out in this paper. At −271.15 °C temperature, the depletion is evaluated under three separate tunneling energy levels. All the circuits are designed with QCADesigner, a broadly used simulation engine and QCAPro tool has been applied for estimating the power dissipation.

  19. Equalizing imbalanced imprecise datasets for genetic fuzzy classifiers

    Directory of Open Access Journals (Sweden)

    AnaM. Palacios

    2012-04-01

    Full Text Available Determining whether an imprecise dataset is imbalanced is not immediate. The vagueness in the data causes that the prior probabilities of the classes are not precisely known, and therefore the degree of imbalance can also be uncertain. In this paper we propose suitable extensions of different resampling algorithms that can be applied to interval valued, multi-labelled data. By means of these extended preprocessing algorithms, certain classification systems designed for minimizing the fraction of misclassifications are able to produce knowledge bases that are also adequate under common metrics for imbalanced classification.

  20. Scientific Datasets: Discovery and Aggregation for Semantic Interpretation.

    Science.gov (United States)

    Lopez, L. A.; Scott, S.; Khalsa, S. J. S.; Duerr, R.

    2015-12-01

    One of the biggest challenges that interdisciplinary researchers face is finding suitable datasets in order to advance their science; this problem remains consistent across multiple disciplines. A surprising number of scientists, when asked what tool they use for data discovery, reply "Google", which is an acceptable solution in some cases but not even Google can find -or cares to compile- all the data that's relevant for science and particularly geo sciences. If a dataset is not discoverable through a well known search provider it will remain dark data to the scientific world.For the past year, BCube, an EarthCube Building Block project, has been developing, testing and deploying a technology stack capable of data discovery at web-scale using the ultimate dataset: The Internet. This stack has 2 principal components, a web-scale crawling infrastructure and a semantic aggregator. The web-crawler is a modified version of Apache Nutch (the originator of Hadoop and other big data technologies) that has been improved and tailored for data and data service discovery. The second component is semantic aggregation, carried out by a python-based workflow that extracts valuable metadata and stores it in the form of triples through the use semantic technologies.While implementing the BCube stack we have run into several challenges such as a) scaling the project to cover big portions of the Internet at a reasonable cost, b) making sense of very diverse and non-homogeneous data, and lastly, c) extracting facts about these datasets using semantic technologies in order to make them usable for the geosciences community. Despite all these challenges we have proven that we can discover and characterize data that otherwise would have remained in the dark corners of the Internet. Having all this data indexed and 'triplelized' will enable scientists to access a trove of information relevant to their work in a more natural way. An important characteristic of the BCube stack is that all

  1. Dataset concerning the analytical approximation of the Ae3 temperature

    Directory of Open Access Journals (Sweden)

    B.L. Ennis

    2017-02-01

    The dataset includes the terms of the function and the values for the polynomial coefficients for major alloying elements in steel. A short description of the approximation method used to derive and validate the coefficients has also been included. For discussion and application of this model, please refer to the full length article entitled “The role of aluminium in chemical and phase segregation in a TRIP-assisted dual phase steel” 10.1016/j.actamat.2016.05.046 (Ennis et al., 2016 [1].

  2. Gene set analysis of the EADGENE chicken data-set

    DEFF Research Database (Denmark)

    Skarman, Axel; Jiang, Li; Hornshøj, Henrik

    2009-01-01

     Abstract Background: Gene set analysis is considered to be a way of improving our biological interpretation of the observed expression patterns. This paper describes different methods applied to analyse expression data from a chicken DNA microarray dataset. Results: Applying different gene set...... analyses to the chicken expression data led to different ranking of the Gene Ontology terms tested. A method for prediction of possible annotations was applied. Conclusion: Biological interpretation based on gene set analyses dependent on the statistical method used. Methods for predicting the possible...

  3. A Validation Dataset for CryoSat Sea Ice Investigators

    DEFF Research Database (Denmark)

    Julia, Gaudelli,; Baker, Steve; Haas, Christian

    Since its launch in April 2010 Cryosat has been collecting valuable sea ice data over the Arctic region. Over the same period ESA’s CryoVEx and NASA IceBridge validation campaigns have been collecting a unique set of coincident airborne measurements in the Arctic. The CryoVal-SI project has...... community. In this talk we will describe the composition of the validation dataset, summarising how it was processed and how to understand the content and format of the data. We will also explain how to access the data and the supporting documentation....

  4. Dataset of statements on policy integration of selected intergovernmental organizations

    Directory of Open Access Journals (Sweden)

    Jale Tosun

    2018-04-01

    Full Text Available This article describes data for 78 intergovernmental organizations (IGOs working on topics related to energy governance, environmental protection, and the economy. The number of IGOs covered also includes organizations active in other sectors. The point of departure for data construction was the Correlates of War dataset, from which we selected this sample of IGOs. We updated and expanded the empirical information on the IGOs selected by manual coding. Most importantly, we collected the primary law texts of the individual IGOs in order to code whether they commit themselves to environmental policy integration (EPI, climate policy integration (CPI and/or energy policy integration (EnPI.

  5. Dataset on the energy performance of atrium type hotel buildings.

    Science.gov (United States)

    Vujosevic, Milica; Krstic-Furundzic, Aleksandra

    2018-04-01

    The data presented in this article are related to the research article entitled "The Influence of Atrium on Energy Performance of Hotel Building" (Vujosevic and Krstic-Furundzic, 2017) [1], which describes the annual energy performance of atrium type hotel building in Belgrade climate conditions, with the objective to present the impact of the atrium on the hotel building's energy demands for space heating and cooling. This dataset is made publicly available to show energy performance of selected hotel design alternatives, in order to enable extended analyzes of these data for other researchers.

  6. Dataset on records of Hericium erinaceus in Slovakia.

    Science.gov (United States)

    Kunca, Vladimír; Čiliak, Marek

    2017-06-01

    The data presented in this article are related to the research article entitled "Habitat preferences of Hericium erinaceus in Slovakia" (Kunca and Čiliak, 2016) [FUNECO607] [2]. The dataset include all available and unpublished data from Slovakia, besides the records from the same tree or stem. We compiled a database of records of collections by processing data from herbaria, personal records and communication with mycological activists. Data on altitude, tree species, host tree vital status, host tree position and intensity of management of forest stands were evaluated in this study. All surveys were based on basidioma occurrence and some result from targeted searches.

  7. Dataset on records of Hericium erinaceus in Slovakia

    Directory of Open Access Journals (Sweden)

    Vladimír Kunca

    2017-06-01

    Full Text Available The data presented in this article are related to the research article entitled “Habitat preferences of Hericium erinaceus in Slovakia” (Kunca and Čiliak, 2016 [FUNECO607] [2]. The dataset include all available and unpublished data from Slovakia, besides the records from the same tree or stem. We compiled a database of records of collections by processing data from herbaria, personal records and communication with mycological activists. Data on altitude, tree species, host tree vital status, host tree position and intensity of management of forest stands were evaluated in this study. All surveys were based on basidioma occurrence and some result from targeted searches.

  8. Machine-learned and codified synthesis parameters of oxide materials

    Science.gov (United States)

    Kim, Edward; Huang, Kevin; Tomala, Alex; Matthews, Sara; Strubell, Emma; Saunders, Adam; McCallum, Andrew; Olivetti, Elsa

    2017-09-01

    Predictive materials design has rapidly accelerated in recent years with the advent of large-scale resources, such as materials structure and property databases generated by ab initio computations. In the absence of analogous ab initio frameworks for materials synthesis, high-throughput and machine learning techniques have recently been harnessed to generate synthesis strategies for select materials of interest. Still, a community-accessible, autonomously-compiled synthesis planning resource which spans across materials systems has not yet been developed. In this work, we present a collection of aggregated synthesis parameters computed using the text contained within over 640,000 journal articles using state-of-the-art natural language processing and machine learning techniques. We provide a dataset of synthesis parameters, compiled autonomously across 30 different oxide systems, in a format optimized for planning novel syntheses of materials.

  9. Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

    Directory of Open Access Journals (Sweden)

    Sai Kiranmayee Samudrala

    2015-01-01

    Full Text Available Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.

  10. The Path from Large Earth Science Datasets to Information

    Science.gov (United States)

    Vicente, G. A.

    2013-12-01

    The NASA Goddard Earth Sciences Data (GES) and Information Services Center (DISC) is one of the major Science Mission Directorate (SMD) for archiving and distribution of Earth Science remote sensing data, products and services. This virtual portal provides convenient access to Atmospheric Composition and Dynamics, Hydrology, Precipitation, Ozone, and model derived datasets (generated by GSFC's Global Modeling and Assimilation Office), the North American Land Data Assimilation System (NLDAS) and the Global Land Data Assimilation System (GLDAS) data products (both generated by GSFC's Hydrological Sciences Branch). This presentation demonstrates various tools and computational technologies developed in the GES DISC to manage the huge volume of data and products acquired from various missions and programs over the years. It explores approaches to archive, document, distribute, access and analyze Earth Science data and information as well as addresses the technical and scientific issues, governance and user support problem faced by scientists in need of multi-disciplinary datasets. It also discusses data and product metrics, user distribution profiles and lessons learned through interactions with the science communities around the world. Finally it demonstrates some of the most used data and product visualization and analyses tools developed and maintained by the GES DISC.

  11. Benchmarking undedicated cloud computing providers for analysis of genomic datasets.

    Science.gov (United States)

    Yazar, Seyhan; Gooden, George E C; Mackey, David A; Hewitt, Alex W

    2014-01-01

    A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5-78.2) for E.coli and 53.5% (95% CI: 34.4-72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5-303.1) and 173.9% (95% CI: 134.6-213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.

  12. Robust computational analysis of rRNA hypervariable tag datasets.

    Directory of Open Access Journals (Sweden)

    Maksim Sipos

    Full Text Available Next-generation DNA sequencing is increasingly being utilized to probe microbial communities, such as gastrointestinal microbiomes, where it is important to be able to quantify measures of abundance and diversity. The fragmented nature of the 16S rRNA datasets obtained, coupled with their unprecedented size, has led to the recognition that the results of such analyses are potentially contaminated by a variety of artifacts, both experimental and computational. Here we quantify how multiple alignment and clustering errors contribute to overestimates of abundance and diversity, reflected by incorrect OTU assignment, corrupted phylogenies, inaccurate species diversity estimators, and rank abundance distribution functions. We show that straightforward procedural optimizations, combining preexisting tools, are effective in handling large (10(5-10(6 16S rRNA datasets, and we describe metrics to measure the effectiveness and quality of the estimators obtained. We introduce two metrics to ascertain the quality of clustering of pyrosequenced rRNA data, and show that complete linkage clustering greatly outperforms other widely used methods.

  13. BLAST-EXPLORER helps you building datasets for phylogenetic analysis

    Directory of Open Access Journals (Sweden)

    Claverie Jean-Michel

    2010-01-01

    Full Text Available Abstract Background The right sampling of homologous sequences for phylogenetic or molecular evolution analyses is a crucial step, the quality of which can have a significant impact on the final interpretation of the study. There is no single way for constructing datasets suitable for phylogenetic analysis, because this task intimately depends on the scientific question we want to address, Moreover, database mining softwares such as BLAST which are routinely used for searching homologous sequences are not specifically optimized for this task. Results To fill this gap, we designed BLAST-Explorer, an original and friendly web-based application that combines a BLAST search with a suite of tools that allows interactive, phylogenetic-oriented exploration of the BLAST results and flexible selection of homologous sequences among the BLAST hits. Once the selection of the BLAST hits is done using BLAST-Explorer, the corresponding sequence can be imported locally for external analysis or passed to the phylogenetic tree reconstruction pipelines available on the Phylogeny.fr platform. Conclusions BLAST-Explorer provides a simple, intuitive and interactive graphical representation of the BLAST results and allows selection and retrieving of the BLAST hit sequences based a wide range of criterions. Although BLAST-Explorer primarily aims at helping the construction of sequence datasets for further phylogenetic study, it can also be used as a standard BLAST server with enriched output. BLAST-Explorer is available at http://www.phylogeny.fr

  14. Multiresolution comparison of precipitation datasets for large-scale models

    Science.gov (United States)

    Chun, K. P.; Sapriza Azuri, G.; Davison, B.; DeBeer, C. M.; Wheater, H. S.

    2014-12-01

    Gridded precipitation datasets are crucial for driving large-scale models which are related to weather forecast and climate research. However, the quality of precipitation products is usually validated individually. Comparisons between gridded precipitation products along with ground observations provide another avenue for investigating how the precipitation uncertainty would affect the performance of large-scale models. In this study, using data from a set of precipitation gauges over British Columbia and Alberta, we evaluate several widely used North America gridded products including the Canadian Gridded Precipitation Anomalies (CANGRD), the National Center for Environmental Prediction (NCEP) reanalysis, the Water and Global Change (WATCH) project, the thin plate spline smoothing algorithms (ANUSPLIN) and Canadian Precipitation Analysis (CaPA). Based on verification criteria for various temporal and spatial scales, results provide an assessment of possible applications for various precipitation datasets. For long-term climate variation studies (~100 years), CANGRD, NCEP, WATCH and ANUSPLIN have different comparative advantages in terms of their resolution and accuracy. For synoptic and mesoscale precipitation patterns, CaPA provides appealing performance of spatial coherence. In addition to the products comparison, various downscaling methods are also surveyed to explore new verification and bias-reduction methods for improving gridded precipitation outputs for large-scale models.

  15. Benchmarking Deep Learning Models on Large Healthcare Datasets.

    Science.gov (United States)

    Purushotham, Sanjay; Meng, Chuizheng; Che, Zhengping; Liu, Yan

    2018-06-04

    Deep learning models (aka Deep Neural Networks) have revolutionized many fields including computer vision, natural language processing, speech recognition, and is being increasingly used in clinical healthcare applications. However, few works exist which have benchmarked the performance of the deep learning models with respect to the state-of-the-art machine learning models and prognostic scoring systems on publicly available healthcare datasets. In this paper, we present the benchmarking results for several clinical prediction tasks such as mortality prediction, length of stay prediction, and ICD-9 code group prediction using Deep Learning models, ensemble of machine learning models (Super Learner algorithm), SAPS II and SOFA scores. We used the Medical Information Mart for Intensive Care III (MIMIC-III) (v1.4) publicly available dataset, which includes all patients admitted to an ICU at the Beth Israel Deaconess Medical Center from 2001 to 2012, for the benchmarking tasks. Our results show that deep learning models consistently outperform all the other approaches especially when the 'raw' clinical time series data is used as input features to the models. Copyright © 2018 Elsevier Inc. All rights reserved.

  16. Testing the Neutral Theory of Biodiversity with Human Microbiome Datasets.

    Science.gov (United States)

    Li, Lianwei; Ma, Zhanshan Sam

    2016-08-16

    The human microbiome project (HMP) has made it possible to test important ecological theories for arguably the most important ecosystem to human health-the human microbiome. Existing limited number of studies have reported conflicting evidence in the case of the neutral theory; the present study aims to comprehensively test the neutral theory with extensive HMP datasets covering all five major body sites inhabited by the human microbiome. Utilizing 7437 datasets of bacterial community samples, we discovered that only 49 communities (less than 1%) satisfied the neutral theory, and concluded that human microbial communities are not neutral in general. The 49 positive cases, although only a tiny minority, do demonstrate the existence of neutral processes. We realize that the traditional doctrine of microbial biogeography "Everything is everywhere, but the environment selects" first proposed by Baas-Becking resolves the apparent contradiction. The first part of Baas-Becking doctrine states that microbes are not dispersal-limited and therefore are neutral prone, and the second part reiterates that the freely dispersed microbes must endure selection by the environment. Therefore, in most cases, it is the host environment that ultimately shapes the community assembly and tip the human microbiome to niche regime.

  17. Overview of the CERES Edition-4 Multilayer Cloud Property Datasets

    Science.gov (United States)

    Chang, F. L.; Minnis, P.; Sun-Mack, S.; Chen, Y.; Smith, R. A.; Brown, R. R.

    2014-12-01

    Knowledge of the cloud vertical distribution is important for understanding the role of clouds on earth's radiation budget and climate change. Since high-level cirrus clouds with low emission temperatures and small optical depths can provide a positive feedback to a climate system and low-level stratus clouds with high emission temperatures and large optical depths can provide a negative feedback effect, the retrieval of multilayer cloud properties using satellite observations, like Terra and Aqua MODIS, is critically important for a variety of cloud and climate applications. For the objective of the Clouds and the Earth's Radiant Energy System (CERES), new algorithms have been developed using Terra and Aqua MODIS data to allow separate retrievals of cirrus and stratus cloud properties when the two dominant cloud types are simultaneously present in a multilayer system. In this paper, we will present an overview of the new CERES Edition-4 multilayer cloud property datasets derived from Terra as well as Aqua. Assessment of the new CERES multilayer cloud datasets will include high-level cirrus and low-level stratus cloud heights, pressures, and temperatures as well as their optical depths, emissivities, and microphysical properties.

  18. Predicting weather regime transitions in Northern Hemisphere datasets

    Energy Technology Data Exchange (ETDEWEB)

    Kondrashov, D. [University of California, Department of Atmospheric and Oceanic Sciences and Institute of Geophysics and Planetary Physics, Los Angeles, CA (United States); Shen, J. [UCLA, Department of Statistics, Los Angeles, CA (United States); Berk, R. [UCLA, Department of Statistics, Los Angeles, CA (United States); University of Pennsylvania, Department of Criminology, Philadelphia, PA (United States); D' Andrea, F.; Ghil, M. [Ecole Normale Superieure, Departement Terre-Atmosphere-Ocean and Laboratoire de Meteorologie Dynamique (CNRS and IPSL), Paris Cedex 05 (France)

    2007-10-15

    A statistical learning method called random forests is applied to the prediction of transitions between weather regimes of wintertime Northern Hemisphere (NH) atmospheric low-frequency variability. A dataset composed of 55 winters of NH 700-mb geopotential height anomalies is used in the present study. A mixture model finds that the three Gaussian components that were statistically significant in earlier work are robust; they are the Pacific-North American (PNA) regime, its approximate reverse (the reverse PNA, or RNA), and the blocked phase of the North Atlantic Oscillation (BNAO). The most significant and robust transitions in the Markov chain generated by these regimes are PNA {yields} BNAO, PNA {yields} RNA and BNAO {yields} PNA. The break of a regime and subsequent onset of another one is forecast for these three transitions. Taking the relative costs of false positives and false negatives into account, the random-forests method shows useful forecasting skill. The calculations are carried out in the phase space spanned by a few leading empirical orthogonal functions of dataset variability. Plots of estimated response functions to a given predictor confirm the crucial influence of the exit angle on a preferred transition path. This result points to the dynamic origin of the transitions. (orig.)

  19. Digital Astronaut Photography: A Discovery Dataset for Archaeology

    Science.gov (United States)

    Stefanov, William L.

    2010-01-01

    Astronaut photography acquired from the International Space Station (ISS) using commercial off-the-shelf cameras offers a freely-accessible source for high to very high resolution (4-20 m/pixel) visible-wavelength digital data of Earth. Since ISS Expedition 1 in 2000, over 373,000 images of the Earth-Moon system (including land surface, ocean, atmospheric, and lunar images) have been added to the Gateway to Astronaut Photography of Earth online database (http://eol.jsc.nasa.gov ). Handheld astronaut photographs vary in look angle, time of acquisition, solar illumination, and spatial resolution. These attributes of digital astronaut photography result from a unique combination of ISS orbital dynamics, mission operations, camera systems, and the individual skills of the astronaut. The variable nature of astronaut photography makes the dataset uniquely useful for archaeological applications in comparison with more traditional nadir-viewing multispectral datasets acquired from unmanned orbital platforms. For example, surface features such as trenches, walls, ruins, urban patterns, and vegetation clearing and regrowth patterns may be accentuated by low sun angles and oblique viewing conditions (Fig. 1). High spatial resolution digital astronaut photographs can also be used with sophisticated land cover classification and spatial analysis approaches like Object Based Image Analysis, increasing the potential for use in archaeological characterization of landscapes and specific sites.

  20. ISC-EHB: Reconstruction of a robust earthquake dataset

    Science.gov (United States)

    Weston, J.; Engdahl, E. R.; Harris, J.; Di Giacomo, D.; Storchak, D. A.

    2018-04-01

    The EHB Bulletin of hypocentres and associated travel-time residuals was originally developed with procedures described by Engdahl, Van der Hilst and Buland (1998) and currently ends in 2008. It is a widely used seismological dataset, which is now expanded and reconstructed, partly by exploiting updated procedures at the International Seismological Centre (ISC), to produce the ISC-EHB. The reconstruction begins in the modern period (2000-2013) to which new and more rigorous procedures for event selection, data preparation, processing, and relocation are applied. The selection criteria minimise the location bias produced by unmodelled 3D Earth structure, resulting in events that are relatively well located in any given region. Depths of the selected events are significantly improved by a more comprehensive review of near station and secondary phase travel-time residuals based on ISC data, especially for the depth phases pP, pwP and sP, as well as by a rigorous review of the event depths in subduction zone cross sections. The resulting cross sections and associated maps are shown to provide details of seismicity in subduction zones in much greater detail than previously achievable. The new ISC-EHB dataset will be especially useful for global seismicity studies and high-frequency regional and global tomographic inversions.

  1. Benchmarking undedicated cloud computing providers for analysis of genomic datasets.

    Directory of Open Access Journals (Sweden)

    Seyhan Yazar

    Full Text Available A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR on Amazon EC2 instances and Google Compute Engine (GCE, using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5-78.2 for E.coli and 53.5% (95% CI: 34.4-72.6 for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5-303.1 and 173.9% (95% CI: 134.6-213.1 more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.

  2. Condensing Massive Satellite Datasets For Rapid Interactive Analysis

    Science.gov (United States)

    Grant, G.; Gallaher, D. W.; Lv, Q.; Campbell, G. G.; Fowler, C.; LIU, Q.; Chen, C.; Klucik, R.; McAllister, R. A.

    2015-12-01

    Our goal is to enable users to interactively analyze massive satellite datasets, identifying anomalous data or values that fall outside of thresholds. To achieve this, the project seeks to create a derived database containing only the most relevant information, accelerating the analysis process. The database is designed to be an ancillary tool for the researcher, not an archival database to replace the original data. This approach is aimed at improving performance by reducing the overall size by way of condensing the data. The primary challenges of the project include: - The nature of the research question(s) may not be known ahead of time. - The thresholds for determining anomalies may be uncertain. - Problems associated with processing cloudy, missing, or noisy satellite imagery. - The contents and method of creation of the condensed dataset must be easily explainable to users. The architecture of the database will reorganize spatially-oriented satellite imagery into temporally-oriented columns of data (a.k.a., "data rods") to facilitate time-series analysis. The database itself is an open-source parallel database, designed to make full use of clustered server technologies. A demonstration of the system capabilities will be shown. Applications for this technology include quick-look views of the data, as well as the potential for on-board satellite processing of essential information, with the goal of reducing data latency.

  3. Utilizing the Antarctic Master Directory to find orphan datasets

    Science.gov (United States)

    Bonczkowski, J.; Carbotte, S. M.; Arko, R. A.; Grebas, S. K.

    2011-12-01

    While most Antarctic data are housed at an established disciplinary-specific data repository, there are data types for which no suitable repository exists. In some cases, these "orphan" data, without an appropriate national archive, are served from local servers by the principal investigators who produced the data. There are many pitfalls with data served privately, including the frequent lack of adequate documentation to ensure the data can be understood by others for re-use and the impermanence of personal web sites. For example, if an investigator leaves an institution and the data moves, the link published is no longer accessible. To ensure continued availability of data, submission to long-term national data repositories is needed. As stated in the National Science Foundation Office of Polar Programs (NSF/OPP) Guidelines and Award Conditions for Scientific Data, investigators are obligated to submit their data for curation and long-term preservation; this includes the registration of a dataset description into the Antarctic Master Directory (AMD), http://gcmd.nasa.gov/Data/portals/amd/. The AMD is a Web-based, searchable directory of thousands of dataset descriptions, known as DIF records, submitted by scientists from over 20 countries. It serves as a node of the International Directory Network/Global Change Master Directory (IDN/GCMD). The US Antarctic Program Data Coordination Center (USAP-DCC), http://www.usap-data.org/, funded through NSF/OPP, was established in 2007 to help streamline the process of data submission and DIF record creation. When data does not quite fit within any existing disciplinary repository, it can be registered within the USAP-DCC as the fallback data repository. Within the scope of the USAP-DCC we undertook the challenge of discovering and "rescuing" orphan datasets currently registered within the AMD. In order to find which DIF records led to data served privately, all records relating to US data within the AMD were parsed. After

  4. NERIES: Seismic Data Gateways and User Composed Datasets Metadata Management

    Science.gov (United States)

    Spinuso, Alessandro; Trani, Luca; Kamb, Linus; Frobert, Laurent

    2010-05-01

    One of the NERIES EC project main objectives is to establish and improve the networking of seismic waveform data exchange and access among four main data centers in Europe: INGV, GFZ, ORFEUS and IPGP. Besides the implementation of the data backbone, several investigations and developments have been conducted in order to offer to the users the data available from this network, either programmatically or interactively. One of the challenges is to understand how to enable users` activities such as discovering, aggregating, describing and sharing datasets to obtain a decrease in the replication of similar data queries towards the network, exempting the data centers to guess and create useful pre-packed products. We`ve started to transfer this task more and more towards the users community, where the users` composed data products could be extensively re-used. The main link to the data is represented by a centralized webservice (SeismoLink) acting like a single access point to the whole data network. Users can download either waveform data or seismic station inventories directly from their own software routines by connecting to this webservice, which routes the request to the data centers. The provenance of the data is maintained and transferred to the users in the form of URIs, that identify the dataset and implicitly refer to the data provider. SeismoLink, combined with other webservices (eg EMSC-QuakeML earthquakes catalog service), is used from a community gateway such as the NERIES web portal (http://www.seismicportal.eu). Here the user interacts with a map based portlet which allows the dynamic composition of a data product, binding seismic event`s parameters with a set of seismic stations. The requested data is collected by the back-end processes of the portal, preserved and offered to the user in a personal data cart, where metadata can be generated interactively on-demand. The metadata, expressed in RDF, can also be remotely ingested. They offer rating

  5. Accuracy assessment of seven global land cover datasets over China

    Science.gov (United States)

    Yang, Yongke; Xiao, Pengfeng; Feng, Xuezhi; Li, Haixing

    2017-03-01

    Land cover (LC) is the vital foundation to Earth science. Up to now, several global LC datasets have arisen with efforts of many scientific communities. To provide guidelines for data usage over China, nine LC maps from seven global LC datasets (IGBP DISCover, UMD, GLC, MCD12Q1, GLCNMO, CCI-LC, and GlobeLand30) were evaluated in this study. First, we compared their similarities and discrepancies in both area and spatial patterns, and analysed their inherent relations to data sources and classification schemes and methods. Next, five sets of validation sample units (VSUs) were collected to calculate their accuracy quantitatively. Further, we built a spatial analysis model and depicted their spatial variation in accuracy based on the five sets of VSUs. The results show that, there are evident discrepancies among these LC maps in both area and spatial patterns. For LC maps produced by different institutes, GLC 2000 and CCI-LC 2000 have the highest overall spatial agreement (53.8%). For LC maps produced by same institutes, overall spatial agreement of CCI-LC 2000 and 2010, and MCD12Q1 2001 and 2010 reach up to 99.8% and 73.2%, respectively; while more efforts are still needed if we hope to use these LC maps as time series data for model inputting, since both CCI-LC and MCD12Q1 fail to represent the rapid changing trend of several key LC classes in the early 21st century, in particular urban and built-up, snow and ice, water bodies, and permanent wetlands. With the highest spatial resolution, the overall accuracy of GlobeLand30 2010 is 82.39%. For the other six LC datasets with coarse resolution, CCI-LC 2010/2000 has the highest overall accuracy, and following are MCD12Q1 2010/2001, GLC 2000, GLCNMO 2008, IGBP DISCover, and UMD in turn. Beside that all maps exhibit high accuracy in homogeneous regions; local accuracies in other regions are quite different, particularly in Farming-Pastoral Zone of North China, mountains in Northeast China, and Southeast Hills. Special

  6. Gridded 5km GHCN-Daily Temperature and Precipitation Dataset, Version 1

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — The Gridded 5km GHCN-Daily Temperature and Precipitation Dataset (nClimGrid) consists of four climate variables derived from the GHCN-D dataset: maximum temperature,...

  7. Active Semisupervised Clustering Algorithm with Label Propagation for Imbalanced and Multidensity Datasets

    Directory of Open Access Journals (Sweden)

    Mingwei Leng

    2013-01-01

    Full Text Available The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.

  8. Dataset for Probabilistic estimation of residential air exchange rates for population-based exposure modeling

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset provides the city-specific air exchange rate measurements, modeled, literature-based as well as housing characteristics. This dataset is associated with...

  9. An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets

    Directory of Open Access Journals (Sweden)

    Kang Zhang

    2014-01-01

    Full Text Available Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.

  10. Ecohydrological Index, Native Fish, and Climate Trends and Relationships in the Kansas River Basin_dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — The dataset is an excel file that contain data for the figures in the manuscript. This dataset is associated with the following publication: Sinnathamby, S., K....

  11. Global Human Built-up And Settlement Extent (HBASE) Dataset From Landsat

    Data.gov (United States)

    National Aeronautics and Space Administration — The Global Human Built-up And Settlement Extent (HBASE) Dataset from Landsat is a global map of HBASE derived from the Global Land Survey (GLS) Landsat dataset for...

  12. An Automatic Matcher and Linker for Transportation Datasets

    Directory of Open Access Journals (Sweden)

    Ali Masri

    2017-01-01

    Full Text Available Multimodality requires the integration of heterogeneous transportation data to construct a broad view of the transportation network. Many new transportation services are emerging while being isolated from previously-existing networks. This leads them to publish their data sources to the web, according to linked data principles, in order to gain visibility. Our interest is to use these data to construct an extended transportation network that links these new services to existing ones. The main problems we tackle in this article fall in the categories of automatic schema matching and data interlinking. We propose an approach that uses web services as mediators to help in automatically detecting geospatial properties and mapping them between two different schemas. On the other hand, we propose a new interlinking approach that enables the user to define rich semantic links between datasets in a flexible and customizable way.

  13. [Parallel virtual reality visualization of extreme large medical datasets].

    Science.gov (United States)

    Tang, Min

    2010-04-01

    On the basis of a brief description of grid computing, the essence and critical techniques of parallel visualization of extreme large medical datasets are discussed in connection with Intranet and common-configuration computers of hospitals. In this paper are introduced several kernel techniques, including the hardware structure, software framework, load balance and virtual reality visualization. The Maximum Intensity Projection algorithm is realized in parallel using common PC cluster. In virtual reality world, three-dimensional models can be rotated, zoomed, translated and cut interactively and conveniently through the control panel built on virtual reality modeling language (VRML). Experimental results demonstrate that this method provides promising and real-time results for playing the role in of a good assistant in making clinical diagnosis.

  14. The wildland-urban interface raster dataset of Catalonia.

    Science.gov (United States)

    Alcasena, Fermín J; Evers, Cody R; Vega-Garcia, Cristina

    2018-04-01

    We provide the wildland urban interface (WUI) map of the autonomous community of Catalonia (Northeastern Spain). The map encompasses an area of some 3.21 million ha and is presented as a 150-m resolution raster dataset. Individual housing location, structure density and vegetation cover data were used to spatially assess in detail the interface, intermix and dispersed rural WUI communities with a geographical information system. Most WUI areas concentrate in the coastal belt where suburban sprawl has occurred nearby or within unmanaged forests. This geospatial information data provides an approximation of residential housing potential for loss given a wildfire, and represents a valuable contribution to assist landscape and urban planning in the region.

  15. xarray: N-D labeled Arrays and Datasets in Python

    Directory of Open Access Journals (Sweden)

    Stephan Hoyer

    2017-04-01

    Full Text Available xarray is an open source project and Python package that provides a toolkit and data structures for N-dimensional labeled arrays. Our approach combines an application programing interface (API inspired by pandas with the Common Data Model for self-described scientific data. Key features of the xarray package include label-based indexing and arithmetic, interoperability with the core scientific Python packages (e.g., pandas, NumPy, Matplotlib, out-of-core computation on datasets that don’t fit into memory, a wide range of serialization and input/output (I/O options, and advanced multi-dimensional data manipulation tools such as group-by and resampling. xarray, as a data model and analytics toolkit, has been widely adopted in the geoscience community but is also used more broadly for multi-dimensional data analysis in physics, machine learning and finance.

  16. The wildland-urban interface raster dataset of Catalonia

    Directory of Open Access Journals (Sweden)

    Fermín J. Alcasena

    2018-04-01

    Full Text Available We provide the wildland urban interface (WUI map of the autonomous community of Catalonia (Northeastern Spain. The map encompasses an area of some 3.21 million ha and is presented as a 150-m resolution raster dataset. Individual housing location, structure density and vegetation cover data were used to spatially assess in detail the interface, intermix and dispersed rural WUI communities with a geographical information system. Most WUI areas concentrate in the coastal belt where suburban sprawl has occurred nearby or within unmanaged forests. This geospatial information data provides an approximation of residential housing potential for loss given a wildfire, and represents a valuable contribution to assist landscape and urban planning in the region. Keywords: Wildland-urban interface, Wildfire risk, Urban planning, Human communities, Catalonia

  17. Reconstructing flaw image using dataset of full matrix capture technique

    Energy Technology Data Exchange (ETDEWEB)

    Lee, Tae Hun; Kim, Yong Sik; Lee, Jeong Seok [KHNP Central Research Institute, Daejeon (Korea, Republic of)

    2017-02-15

    A conventional phased array ultrasonic system offers the ability to steer an ultrasonic beam by applying independent time delays of individual elements in the array and produce an ultrasonic image. In contrast, full matrix capture (FMC) is a data acquisition process that collects a complete matrix of A-scans from every possible independent transmit-receive combination in a phased array transducer and makes it possible to reconstruct various images that cannot be produced by conventional phased array with the post processing as well as images equivalent to a conventional phased array image. In this paper, a basic algorithm based on the LLL mode total focusing method (TFM) that can image crack type flaws is described. And this technique was applied to reconstruct flaw images from the FMC dataset obtained from the experiments and ultrasonic simulation.

  18. Survey dataset on occupational hazards on construction sites

    Directory of Open Access Journals (Sweden)

    Patience F. Tunji-Olayeni

    2018-06-01

    Full Text Available The construction site provides an unfriendly working conditions, exposing workers to one of the harshest environments at a workplace. In this dataset, a structured questionnaire was design directed to thirty-five (35 craftsmen selected through a purposive sampling technique on various construction sites in one of the most populous cities in sub-Saharan Africa. The set of descriptive statistics is presented with tables, stacked bar chats and pie charts. Common occupational health conditions affecting the cardiovascular, respiratory and musculoskeletal systems of craftsmen on construction sites were identified. The effects of occupational health hazards on craftsmen and on construction project performance can be determined when the data is analyzed. Moreover, contractors’ commitment to occupational health and safety (OHS can be obtained from the analysis of the survey data. Keywords: Accidents, Construction industry, Craftsmen, Health, Occupational hazards

  19. Feedback control in deep drawing based on experimental datasets

    Science.gov (United States)

    Fischer, P.; Heingärtner, J.; Aichholzer, W.; Hortig, D.; Hora, P.

    2017-09-01

    In large-scale production of deep drawing parts, like in automotive industry, the effects of scattering material properties as well as warming of the tools have a significant impact on the drawing result. In the scope of the work, an approach is presented to minimize the influence of these effects on part quality by optically measuring the draw-in of each part and adjusting the settings of the press to keep the strain distribution, which is represented by the draw-in, inside a certain limit. For the design of the control algorithm, a design of experiments for in-line tests is used to quantify the influence of the blank holder force as well as the force distribution on the draw-in. The results of this experimental dataset are used to model the process behavior. Based on this model, a feedback control loop is designed. Finally, the performance of the control algorithm is validated in the production line.

  20. Orthology detection combining clustering and synteny for very large datasets.

    Science.gov (United States)

    Lechner, Marcus; Hernandez-Rosales, Maribel; Doerr, Daniel; Wieseke, Nicolas; Thévenin, Annelyse; Stoye, Jens; Hartmann, Roland K; Prohaska, Sonja J; Stadler, Peter F

    2014-01-01

    The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.

  1. Orthology detection combining clustering and synteny for very large datasets.

    Directory of Open Access Journals (Sweden)

    Marcus Lechner

    Full Text Available The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.

  2. Comprehensive comparison of large-scale tissue expression datasets

    DEFF Research Database (Denmark)

    Santos Delgado, Alberto; Tsafou, Kalliopi; Stolte, Christian

    2015-01-01

    a comprehensive evaluation of tissue expression data from a variety of experimental techniques and show that these agree surprisingly well with each other and with results from literature curation and text mining. We further found that most datasets support the assumed but not demonstrated distinction between......For tissues to carry out their functions, they rely on the right proteins to be present. Several high-throughput technologies have been used to map out which proteins are expressed in which tissues; however, the data have not previously been systematically compared and integrated. We present......://tissues.jensenlab.org), which makes all the scored and integrated data available through a single user-friendly web interface....

  3. The SAIL databank: linking multiple health and social care datasets

    Directory of Open Access Journals (Sweden)

    Ford David V

    2009-01-01

    Full Text Available Abstract Background Vast amounts of data are collected about patients and service users in the course of health and social care service delivery. Electronic data systems for patient records have the potential to revolutionise service delivery and research. But in order to achieve this, it is essential that the ability to link the data at the individual record level be retained whilst adhering to the principles of information governance. The SAIL (Secure Anonymised Information Linkage databank has been established using disparate datasets, and over 500 million records from multiple health and social care service providers have been loaded to date, with further growth in progress. Methods Having established the infrastructure of the databank, the aim of this work was to develop and implement an accurate matching process to enable the assignment of a unique Anonymous Linking Field (ALF to person-based records to make the databank ready for record-linkage research studies. An SQL-based matching algorithm (MACRAL, Matching Algorithm for Consistent Results in Anonymised Linkage was developed for this purpose. Firstly the suitability of using a valid NHS number as the basis of a unique identifier was assessed using MACRAL. Secondly, MACRAL was applied in turn to match primary care, secondary care and social services datasets to the NHS Administrative Register (NHSAR, to assess the efficacy of this process, and the optimum matching technique. Results The validation of using the NHS number yielded specificity values > 99.8% and sensitivity values > 94.6% using probabilistic record linkage (PRL at the 50% threshold, and error rates were Conclusion With the infrastructure that has been put in place, the reliable matching process that has been developed enables an ALF to be consistently allocated to records in the databank. The SAIL databank represents a research-ready platform for record-linkage studies.

  4. Analysis of Public Datasets for Wearable Fall Detection Systems.

    Science.gov (United States)

    Casilari, Eduardo; Santoyo-Ramón, José-Antonio; Cano-García, José-Manuel

    2017-06-27

    Due to the boom of wireless handheld devices such as smartwatches and smartphones, wearable Fall Detection Systems (FDSs) have become a major focus of attention among the research community during the last years. The effectiveness of a wearable FDS must be contrasted against a wide variety of measurements obtained from inertial sensors during the occurrence of falls and Activities of Daily Living (ADLs). In this regard, the access to public databases constitutes the basis for an open and systematic assessment of fall detection techniques. This paper reviews and appraises twelve existing available data repositories containing measurements of ADLs and emulated falls envisaged for the evaluation of fall detection algorithms in wearable FDSs. The analysis of the found datasets is performed in a comprehensive way, taking into account the multiple factors involved in the definition of the testbeds deployed for the generation of the mobility samples. The study of the traces brings to light the lack of a common experimental benchmarking procedure and, consequently, the large heterogeneity of the datasets from a number of perspectives (length and number of samples, typology of the emulated falls and ADLs, characteristics of the test subjects, features and positions of the sensors, etc.). Concerning this, the statistical analysis of the samples reveals the impact of the sensor range on the reliability of the traces. In addition, the study evidences the importance of the selection of the ADLs and the need of categorizing the ADLs depending on the intensity of the movements in order to evaluate the capability of a certain detection algorithm to discriminate falls from ADLs.

  5. Clusterflock: a flocking algorithm for isolating congruent phylogenomic datasets.

    Science.gov (United States)

    Narechania, Apurva; Baker, Richard; DeSalle, Rob; Mathema, Barun; Kolokotronis, Sergios-Orestis; Kreiswirth, Barry; Planet, Paul J

    2016-10-24

    Collective animal behavior, such as the flocking of birds or the shoaling of fish, has inspired a class of algorithms designed to optimize distance-based clusters in various applications, including document analysis and DNA microarrays. In a flocking model, individual agents respond only to their immediate environment and move according to a few simple rules. After several iterations the agents self-organize, and clusters emerge without the need for partitional seeds. In addition to its unsupervised nature, flocking offers several computational advantages, including the potential to reduce the number of required comparisons. In the tool presented here, Clusterflock, we have implemented a flocking algorithm designed to locate groups (flocks) of orthologous gene families (OGFs) that share an evolutionary history. Pairwise distances that measure phylogenetic incongruence between OGFs guide flock formation. We tested this approach on several simulated datasets by varying the number of underlying topologies, the proportion of missing data, and evolutionary rates, and show that in datasets containing high levels of missing data and rate heterogeneity, Clusterflock outperforms other well-established clustering techniques. We also verified its utility on a known, large-scale recombination event in Staphylococcus aureus. By isolating sets of OGFs with divergent phylogenetic signals, we were able to pinpoint the recombined region without forcing a pre-determined number of groupings or defining a pre-determined incongruence threshold. Clusterflock is an open-source tool that can be used to discover horizontally transferred genes, recombined areas of chromosomes, and the phylogenetic 'core' of a genome. Although we used it here in an evolutionary context, it is generalizable to any clustering problem. Users can write extensions to calculate any distance metric on the unit interval, and can use these distances to 'flock' any type of data.

  6. The SAIL databank: linking multiple health and social care datasets.

    Science.gov (United States)

    Lyons, Ronan A; Jones, Kerina H; John, Gareth; Brooks, Caroline J; Verplancke, Jean-Philippe; Ford, David V; Brown, Ginevra; Leake, Ken

    2009-01-16

    Vast amounts of data are collected about patients and service users in the course of health and social care service delivery. Electronic data systems for patient records have the potential to revolutionise service delivery and research. But in order to achieve this, it is essential that the ability to link the data at the individual record level be retained whilst adhering to the principles of information governance. The SAIL (Secure Anonymised Information Linkage) databank has been established using disparate datasets, and over 500 million records from multiple health and social care service providers have been loaded to date, with further growth in progress. Having established the infrastructure of the databank, the aim of this work was to develop and implement an accurate matching process to enable the assignment of a unique Anonymous Linking Field (ALF) to person-based records to make the databank ready for record-linkage research studies. An SQL-based matching algorithm (MACRAL, Matching Algorithm for Consistent Results in Anonymised Linkage) was developed for this purpose. Firstly the suitability of using a valid NHS number as the basis of a unique identifier was assessed using MACRAL. Secondly, MACRAL was applied in turn to match primary care, secondary care and social services datasets to the NHS Administrative Register (NHSAR), to assess the efficacy of this process, and the optimum matching technique. The validation of using the NHS number yielded specificity values > 99.8% and sensitivity values > 94.6% using probabilistic record linkage (PRL) at the 50% threshold, and error rates were SAIL databank represents a research-ready platform for record-linkage studies.

  7. Analysis of Public Datasets for Wearable Fall Detection Systems

    Directory of Open Access Journals (Sweden)

    Eduardo Casilari

    2017-06-01

    Full Text Available Due to the boom of wireless handheld devices such as smartwatches and smartphones, wearable Fall Detection Systems (FDSs have become a major focus of attention among the research community during the last years. The effectiveness of a wearable FDS must be contrasted against a wide variety of measurements obtained from inertial sensors during the occurrence of falls and Activities of Daily Living (ADLs. In this regard, the access to public databases constitutes the basis for an open and systematic assessment of fall detection techniques. This paper reviews and appraises twelve existing available data repositories containing measurements of ADLs and emulated falls envisaged for the evaluation of fall detection algorithms in wearable FDSs. The analysis of the found datasets is performed in a comprehensive way, taking into account the multiple factors involved in the definition of the testbeds deployed for the generation of the mobility samples. The study of the traces brings to light the lack of a common experimental benchmarking procedure and, consequently, the large heterogeneity of the datasets from a number of perspectives (length and number of samples, typology of the emulated falls and ADLs, characteristics of the test subjects, features and positions of the sensors, etc.. Concerning this, the statistical analysis of the samples reveals the impact of the sensor range on the reliability of the traces. In addition, the study evidences the importance of the selection of the ADLs and the need of categorizing the ADLs depending on the intensity of the movements in order to evaluate the capability of a certain detection algorithm to discriminate falls from ADLs.

  8. A high quality finger vascular pattern dataset collected using a custom designed capturing device

    NARCIS (Netherlands)

    Ton, B.T.; Veldhuis, Raymond N.J.

    2013-01-01

    The number of finger vascular pattern datasets available for the research community is scarce, therefore a new finger vascular pattern dataset containing 1440 images is prsented. This dataset is unique in its kind as the images are of high resolution and have a known pixel density. Furthermore this

  9. Something From Nothing (There): Collecting Global IPv6 Datasets from DNS

    NARCIS (Netherlands)

    Fiebig, T.; Borgolte, Kevin; Hao, Shuang; Kruegel, Christopher; Vigna, Giovanny; Spring, Neil; Riley, George F.

    2017-01-01

    Current large-scale IPv6 studies mostly rely on non-public datasets, asmost public datasets are domain specific. For instance, traceroute-based datasetsare biased toward network equipment. In this paper, we present a new methodologyto collect IPv6 address datasets that does not require access to

  10. Privacy preserving data anonymization of spontaneous ADE reporting system dataset.

    Science.gov (United States)

    Lin, Wen-Yang; Yang, Duen-Chuan; Wang, Jie-Teng

    2016-07-18

    To facilitate long-term safety surveillance of marketing drugs, many spontaneously reporting systems (SRSs) of ADR events have been established world-wide. Since the data collected by SRSs contain sensitive personal health information that should be protected to prevent the identification of individuals, it procures the issue of privacy preserving data publishing (PPDP), that is, how to sanitize (anonymize) raw data before publishing. Although much work has been done on PPDP, very few studies have focused on protecting privacy of SRS data and none of the anonymization methods is favorable for SRS datasets, due to which contain some characteristics such as rare events, multiple individual records, and multi-valued sensitive attributes. We propose a new privacy model called MS(k, θ (*) )-bounding for protecting published spontaneous ADE reporting data from privacy attacks. Our model has the flexibility of varying privacy thresholds, i.e., θ (*) , for different sensitive values and takes the characteristics of SRS data into consideration. We also propose an anonymization algorithm for sanitizing the raw data to meet the requirements specified through the proposed model. Our algorithm adopts a greedy-based clustering strategy to group the records into clusters, conforming to an innovative anonymization metric aiming to minimize the privacy risk as well as maintain the data utility for ADR detection. Empirical study was conducted using FAERS dataset from 2004Q1 to 2011Q4. We compared our model with four prevailing methods, including k-anonymity, (X, Y)-anonymity, Multi-sensitive l-diversity, and (α, k)-anonymity, evaluated via two measures, Danger Ratio (DR) and Information Loss (IL), and considered three different scenarios of threshold setting for θ (*) , including uniform setting, level-wise setting and frequency-based setting. We also conducted experiments to inspect the impact of anonymized data on the strengths of discovered ADR signals. With all three

  11. Standardization of GIS datasets for emergency preparedness of NPPs

    International Nuclear Information System (INIS)

    Saindane, Shashank S.; Suri, M.M.K.; Otari, Anil; Pradeepkumar, K.S.

    2012-01-01

    Probability of a major nuclear accident which can lead to large scale release of radioactivity into environment is extremely small by the incorporation of safety systems and defence-in-depth philosophy. Nevertheless emergency preparedness for implementation of counter measures to reduce the consequences are required for all major nuclear facilities. Iodine prophylaxis, Sheltering, evacuation etc. are protective measures to be implemented for members of public in the unlikely event of any significant releases from nuclear facilities. Bhabha Atomic Research Centre has developed a GIS supported Nuclear Emergency Preparedness Program. Preparedness for Response to Nuclear emergencies needs geographical details of the affected locations specially Nuclear Power Plant Sites and nearby public domain. Geographical information system data sets which the planners are looking for will have appropriate details in order to take decision and mobilize the resources in time and follow the Standard Operating Procedures. Maps are 2-dimensional representations of our real world and GIS makes it possible to manipulate large amounts of geo-spatially referenced data and convert it into information. This has become an integral part of the nuclear emergency preparedness and response planning. This GIS datasets consisting of layers such as village settlements, roads, hospitals, police stations, shelters etc. is standardized and effectively used during the emergency. The paper focuses on the need of standardization of GIS datasets which in turn can be used as a tool to display and evaluate the impact of standoff distances and selected zones in community planning. It will also highlight the database specifications which will help in fast processing of data and analysis to derive useful and helpful information. GIS has the capability to store, manipulate, analyze and display the large amount of required spatial and tabular data. This study intends to carry out a proper response and preparedness

  12. Forest restoration: a global dataset for biodiversity and vegetation structure.

    Science.gov (United States)

    Crouzeilles, Renato; Ferreira, Mariana S; Curran, Michael

    2016-08-01

    Restoration initiatives are becoming increasingly applied around the world. Billions of dollars have been spent on ecological restoration research and initiatives, but restoration outcomes differ widely among these initiatives in part due to variable socioeconomic and ecological contexts. Here, we present the most comprehensive dataset gathered to date on forest restoration. It encompasses 269 primary studies across 221 study landscapes in 53 countries and contains 4,645 quantitative comparisons between reference ecosystems (e.g., old-growth forest) and degraded or restored ecosystems for five taxonomic groups (mammals, birds, invertebrates, herpetofauna, and plants) and five measures of vegetation structure reflecting different ecological processes (cover, density, height, biomass, and litter). We selected studies that (1) were conducted in forest ecosystems; (2) had multiple replicate sampling sites to measure indicators of biodiversity and/or vegetation structure in reference and restored and/or degraded ecosystems; and (3) used less-disturbed forests as a reference to the ecosystem under study. We recorded (1) latitude and longitude; (2) study year; (3) country; (4) biogeographic realm; (5) past disturbance type; (6) current disturbance type; (7) forest conversion class; (8) restoration activity; (9) time that a system has been disturbed; (10) time elapsed since restoration started; (11) ecological metric used to assess biodiversity; and (12) quantitative value of the ecological metric of biodiversity and/or vegetation structure for reference and restored and/or degraded ecosystems. These were the most common data available in the selected studies. We also estimated forest cover and configuration in each study landscape using a recently developed 1 km consensus land cover dataset. We measured forest configuration as the (1) mean size of all forest patches; (2) size of the largest forest patch; and (3) edge:area ratio of forest patches. Global analyses of the

  13. BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten Bangla Isolated characters

    Directory of Open Access Journals (Sweden)

    Mithun Biswas

    2017-06-01

    Full Text Available BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.

  14. A Research Graph dataset for connecting research data repositories using RD-Switchboard.

    Science.gov (United States)

    Aryani, Amir; Poblet, Marta; Unsworth, Kathryn; Wang, Jingbo; Evans, Ben; Devaraju, Anusuriya; Hausstein, Brigitte; Klas, Claus-Peter; Zapilko, Benjamin; Kaplun, Samuele

    2018-05-29

    This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants. The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation. Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.

  15. Provenance of Earth Science Datasets - How Deep Should One Go?

    Science.gov (United States)

    Ramapriyan, H.; Manipon, G. J. M.; Aulenbach, S.; Duggan, B.; Goldstein, J.; Hua, H.; Tan, D.; Tilmes, C.; Wilson, B. D.; Wolfe, R.; Zednik, S.

    2015-12-01

    For credibility of scientific research, transparency and reproducibility are essential. This fundamental tenet has been emphasized for centuries, and has been receiving increased attention in recent years. The Office of Management and Budget (2002) addressed reproducibility and other aspects of quality and utility of information from federal agencies. Specific guidelines from NASA (2002) are derived from the above. According to these guidelines, "NASA requires a higher standard of quality for information that is considered influential. Influential scientific, financial, or statistical information is defined as NASA information that, when disseminated, will have or does have clear and substantial impact on important public policies or important private sector decisions." For information to be compliant, "the information must be transparent and reproducible to the greatest possible extent." We present how the principles of transparency and reproducibility have been applied to NASA data supporting the Third National Climate Assessment (NCA3). The depth of trace needed of provenance of data used to derive conclusions in NCA3 depends on how the data were used (e.g., qualitatively or quantitatively). Given that the information is diligently maintained in the agency archives, it is possible to trace from a figure in the publication through the datasets, specific files, algorithm versions, instruments used for data collection, and satellites, as well as the individuals and organizations involved in each step. Such trace back permits transparency and reproducibility.

  16. A dataset from bottom trawl survey around Taiwan

    Directory of Open Access Journals (Sweden)

    Kwang-tsao Shao

    2012-05-01

    Full Text Available Bottom trawl fishery is one of the most important coastal fisheries in Taiwan both in production and economic values. However, its annual production started to decline due to overfishing since the 1980s. Its bycatch problem also damages the fishery resource seriously. Thus, the government banned the bottom fishery within 3 nautical miles along the shoreline in 1989. To evaluate the effectiveness of this policy, a four year survey was conducted from 2000–2003, in the waters around Taiwan and Penghu (Pescadore Islands, one region each year respectively. All fish specimens collected from trawling were brought back to lab for identification, individual number count and body weight measurement. These raw data have been integrated and established in Taiwan Fish Database (http://fishdb.sinica.edu.tw. They have also been published through TaiBIF (http://taibif.tw, FishBase and GBIF (website see below. This dataset contains 631 fish species and 3,529 records, making it the most complete demersal fish fauna and their temporal and spatial distributional data on the soft marine habitat in Taiwan.

  17. Integrated interpretation of overlapping AEM datasets achieved through standardisation

    Science.gov (United States)

    Sørensen, Camilla C.; Munday, Tim; Heinson, Graham

    2015-12-01

    Numerous airborne electromagnetic surveys have been acquired in Australia using a variety of systems. It is not uncommon to find two or more surveys covering the same ground, but acquired using different systems and at different times. Being able to combine overlapping datasets and get a spatially coherent resistivity-depth image of the ground can assist geological interpretation, particularly when more subtle geophysical responses are important. Combining resistivity-depth models obtained from the inversion of airborne electromagnetic (AEM) data can be challenging, given differences in system configuration, geometry, flying height and preservation or monitoring of system acquisition parameters such as waveform. In this study, we define and apply an approach to overlapping AEM surveys, acquired by fixed wing and helicopter time domain electromagnetic (EM) systems flown in the vicinity of the Goulds Dam uranium deposit in the Frome Embayment, South Australia, with the aim of mapping the basement geometry and the extent of the Billeroo palaeovalley. Ground EM soundings were used to standardise the AEM data, although results indicated that only data from the REPTEM system needed to be corrected to bring the two surveys into agreement and to achieve coherent spatial resistivity-depth intervals.

  18. A global dataset of sub-daily rainfall indices

    Science.gov (United States)

    Fowler, H. J.; Lewis, E.; Blenkinsop, S.; Guerreiro, S.; Li, X.; Barbero, R.; Chan, S.; Lenderink, G.; Westra, S.

    2017-12-01

    It is still uncertain how hydrological extremes will change with global warming as we do not fully understand the processes that cause extreme precipitation under current climate variability. The INTENSE project is using a novel and fully-integrated data-modelling approach to provide a step-change in our understanding of the nature and drivers of global precipitation extremes and change on societally relevant timescales, leading to improved high-resolution climate model representation of extreme rainfall processes. The INTENSE project is in conjunction with the World Climate Research Programme (WCRP)'s Grand Challenge on 'Understanding and Predicting Weather and Climate Extremes' and the Global Water and Energy Exchanges Project (GEWEX) Science questions. A new global sub-daily precipitation dataset has been constructed (data collection is ongoing). Metadata for each station has been calculated, detailing record lengths, missing data, station locations. A set of global hydroclimatic indices have been produced based upon stakeholder recommendations including indices that describe maximum rainfall totals and timing, the intensity, duration and frequency of storms, frequency of storms above specific thresholds and information about the diurnal cycle. This will provide a unique global data resource on sub-daily precipitation whose derived indices will be freely available to the wider scientific community.

  19. The Centennial Trends Greater Horn of Africa precipitation dataset

    Science.gov (United States)

    Funk, Chris; Nicholson, Sharon E.; Landsfeld, Martin F.; Klotter, Douglas; Peterson, Pete J.; Harrison, Laura

    2015-01-01

    East Africa is a drought prone, food and water insecure region with a highly variable climate. This complexity makes rainfall estimation challenging, and this challenge is compounded by low rain gauge densities and inhomogeneous monitoring networks. The dearth of observations is particularly problematic over the past decade, since the number of records in globally accessible archives has fallen precipitously. This lack of data coincides with an increasing scientific and humanitarian need to place recent seasonal and multi-annual East African precipitation extremes in a deep historic context. To serve this need, scientists from the UC Santa Barbara Climate Hazards Group and Florida State University have pooled their station archives and expertise to produce a high quality gridded ‘Centennial Trends’ precipitation dataset. Additional observations have been acquired from the national meteorological agencies and augmented with data provided by other universities. Extensive quality control of the data was carried out and seasonal anomalies interpolated using kriging. This paper documents the CenTrends methodology and data.

  20. Dataset on daytime outdoor thermal comfort for Belo Horizonte, Brazil.

    Science.gov (United States)

    Hirashima, Simone Queiroz da Silveira; Assis, Eleonora Sad de; Nikolopoulou, Marialena

    2016-12-01

    This dataset describe microclimatic parameters of two urban open public spaces in the city of Belo Horizonte, Brazil; physiological equivalent temperature (PET) index values and the related subjective responses of interviewees regarding thermal sensation perception and preference and thermal comfort evaluation. Individuals and behavioral characteristics of respondents were also presented. Data were collected at daytime, in summer and winter, 2013. Statistical treatment of this data was firstly presented in a PhD Thesis ("Percepção sonora e térmica e avaliação de conforto em espaços urbanos abertos do município de Belo Horizonte - MG, Brasil" (Hirashima, 2014) [1]), providing relevant information on thermal conditions in these locations and on thermal comfort assessment. Up to now, this data was also explored in the article "Daytime Thermal Comfort in Urban Spaces: A Field Study in Brazil" (Hirashima et al., in press) [2]. These references are recommended for further interpretation and discussion.

  1. Statistically and Computationally Efficient Estimating Equations for Large Spatial Datasets

    KAUST Repository

    Sun, Ying

    2014-11-07

    For Gaussian process models, likelihood based methods are often difficult to use with large irregularly spaced spatial datasets, because exact calculations of the likelihood for n observations require O(n3) operations and O(n2) memory. Various approximation methods have been developed to address the computational difficulties. In this paper, we propose new unbiased estimating equations based on score equation approximations that are both computationally and statistically efficient. We replace the inverse covariance matrix that appears in the score equations by a sparse matrix to approximate the quadratic forms, then set the resulting quadratic forms equal to their expected values to obtain unbiased estimating equations. The sparse matrix is constructed by a sparse inverse Cholesky approach to approximate the inverse covariance matrix. The statistical efficiency of the resulting unbiased estimating equations are evaluated both in theory and by numerical studies. Our methods are applied to nearly 90,000 satellite-based measurements of water vapor levels over a region in the Southeast Pacific Ocean.

  2. Statistically and Computationally Efficient Estimating Equations for Large Spatial Datasets

    KAUST Repository

    Sun, Ying; Stein, Michael L.

    2014-01-01

    For Gaussian process models, likelihood based methods are often difficult to use with large irregularly spaced spatial datasets, because exact calculations of the likelihood for n observations require O(n3) operations and O(n2) memory. Various approximation methods have been developed to address the computational difficulties. In this paper, we propose new unbiased estimating equations based on score equation approximations that are both computationally and statistically efficient. We replace the inverse covariance matrix that appears in the score equations by a sparse matrix to approximate the quadratic forms, then set the resulting quadratic forms equal to their expected values to obtain unbiased estimating equations. The sparse matrix is constructed by a sparse inverse Cholesky approach to approximate the inverse covariance matrix. The statistical efficiency of the resulting unbiased estimating equations are evaluated both in theory and by numerical studies. Our methods are applied to nearly 90,000 satellite-based measurements of water vapor levels over a region in the Southeast Pacific Ocean.

  3. Challenges and Experiences of Building Multidisciplinary Datasets across Cultures

    Science.gov (United States)

    Jamiyansharav, K.; Laituri, M.; Fernandez-Gimenez, M.; Fassnacht, S. R.; Venable, N. B. H.; Allegretti, A. M.; Reid, R.; Baival, B.; Jamsranjav, C.; Ulambayar, T.; Linn, S.; Angerer, J.

    2017-12-01

    Efficient data sharing and management are key challenges to multidisciplinary scientific research. These challenges are further complicated by adding a multicultural component. We address the construction of a complex database for social-ecological analysis in Mongolia. Funded by the National Science Foundation (NSF) Dynamics of Coupled Natural and Human (CNH) Systems, the Mongolian Rangelands and Resilience (MOR2) project focuses on the vulnerability of Mongolian pastoral systems to climate change and adaptive capacity. The MOR2 study spans over three years of fieldwork in 36 paired districts (Soum) from 18 provinces (Aimag) of Mongolia that covers steppe, mountain forest steppe, desert steppe and eastern steppe ecological zones. Our project team is composed of hydrologists, social scientists, geographers, and ecologists. The MOR2 database includes multiple ecological, social, meteorological, geospatial and hydrological datasets, as well as archives of original data and survey in multiple formats. Managing this complex database requires significant organizational skills, attention to detail and ability to communicate within collective team members from diverse disciplines and across multiple institutions in the US and Mongolia. We describe the database's rich content, organization, structure and complexity. We discuss lessons learned, best practices and recommendations for complex database management, sharing, and archiving in creating a cross-cultural and multi-disciplinary database.

  4. Automated Fault Interpretation and Extraction using Improved Supplementary Seismic Datasets

    Science.gov (United States)

    Bollmann, T. A.; Shank, R.

    2017-12-01

    During the interpretation of seismic volumes, it is necessary to interpret faults along with horizons of interest. With the improvement of technology, the interpretation of faults can be expedited with the aid of different algorithms that create supplementary seismic attributes, such as semblance and coherency. These products highlight discontinuities, but still need a large amount of human interaction to interpret faults and are plagued by noise and stratigraphic discontinuities. Hale (2013) presents a method to improve on these datasets by creating what is referred to as a Fault Likelihood volume. In general, these volumes contain less noise and do not emphasize stratigraphic features. Instead, planar features within a specified strike and dip range are highlighted. Once a satisfactory Fault Likelihood Volume is created, extraction of fault surfaces is much easier. The extracted fault surfaces are then exported to interpretation software for QC. Numerous software packages have implemented this methodology with varying results. After investigating these platforms, we developed a preferred Automated Fault Interpretation workflow.

  5. Privacy-preserving record linkage on large real world datasets.

    Science.gov (United States)

    Randall, Sean M; Ferrante, Anna M; Boyd, James H; Bauer, Jacqueline K; Semmens, James B

    2014-08-01

    Record linkage typically involves the use of dedicated linkage units who are supplied with personally identifying information to determine individuals from within and across datasets. The personally identifying information supplied to linkage units is separated from clinical information prior to release by data custodians. While this substantially reduces the risk of disclosure of sensitive information, some residual risks still exist and remain a concern for some custodians. In this paper we trial a method of record linkage which reduces privacy risk still further on large real world administrative data. The method uses encrypted personal identifying information (bloom filters) in a probability-based linkage framework. The privacy preserving linkage method was tested on ten years of New South Wales (NSW) and Western Australian (WA) hospital admissions data, comprising in total over 26 million records. No difference in linkage quality was found when the results were compared to traditional probabilistic methods using full unencrypted personal identifiers. This presents as a possible means of reducing privacy risks related to record linkage in population level research studies. It is hoped that through adaptations of this method or similar privacy preserving methods, risks related to information disclosure can be reduced so that the benefits of linked research taking place can be fully realised. Copyright © 2013 Elsevier Inc. All rights reserved.

  6. Genomics dataset on unclassified published organism (patent US 7547531

    Directory of Open Access Journals (Sweden)

    Mohammad Mahfuz Ali Khan Shawan

    2016-12-01

    Full Text Available Nucleotide (DNA sequence analysis provides important clues regarding the characteristics and taxonomic position of an organism. With the intention that, DNA sequence analysis is very crucial to learn about hierarchical classification of that particular organism. This dataset (patent US 7547531 is chosen to simplify all the complex raw data buried in undisclosed DNA sequences which help to open doors for new collaborations. In this data, a total of 48 unidentified DNA sequences from patent US 7547531 were selected and their complete sequences were retrieved from NCBI BioSample database. Quick response (QR code of those DNA sequences was constructed by DNA BarID tool. QR code is useful for the identification and comparison of isolates with other organisms. AT/GC content of the DNA sequences was determined using ENDMEMO GC Content Calculator, which indicates their stability at different temperature. The highest GC content was observed in GP445188 (62.5% which was followed by GP445198 (61.8% and GP445189 (59.44%, while lowest was in GP445178 (24.39%. In addition, New England BioLabs (NEB database was used to identify cleavage code indicating the 5, 3 and blunt end and enzyme code indicating the methylation site of the DNA sequences was also shown. These data will be helpful for the construction of the organisms’ hierarchical classification, determination of their phylogenetic and taxonomic position and revelation of their molecular characteristics.

  7. Structural dataset for the PPARγ V290M mutant

    Directory of Open Access Journals (Sweden)

    Ana C. Puhl

    2016-06-01

    Full Text Available Loss-of-function mutation V290M in the ligand-binding domain of the peroxisome proliferator activated receptor γ (PPARγ is associated with a ligand resistance syndrome (PLRS, characterized by partial lipodystrophy and severe insulin resistance. In this data article we discuss an X-ray diffraction dataset that yielded the structure of PPARγ LBD V290M mutant refined at 2.3 Å resolution, that allowed building of 3D model of the receptor mutant with high confidence and revealed continuous well-defined electron density for the partial agonist diclofenac bound to hydrophobic pocket of the PPARγ. These structural data provide significant insights into molecular basis of PLRS caused by V290M mutation and are correlated with the receptor disability of rosiglitazone binding and increased affinity for corepressors. Furthermore, our structural evidence helps to explain clinical observations which point out to a failure to restore receptor function by the treatment with a full agonist of PPARγ, rosiglitazone.

  8. Oil palm mapping for Malaysia using PALSAR-2 dataset

    Science.gov (United States)

    Gong, P.; Qi, C. Y.; Yu, L.; Cracknell, A.

    2016-12-01

    Oil palm is one of the most productive vegetable oil crops in the world. The main oil palm producing areas are distributed in humid tropical areas such as Malaysia, Indonesia, Thailand, western and central Africa, northern South America, and central America. Increasing market demands, high yields and low production costs of palm oil are the primary factors driving large-scale commercial cultivation of oil palm, especially in Malaysia and Indonesia. Global demand for palm oil has grown exponentially during the last 50 years, and the expansion of oil palm plantations is linked directly to the deforestation of natural forests. Satellite remote sensing plays an important role in monitoring expansion of oil palm. However, optical remote sensing images are difficult to acquire in the Tropics because of the frequent occurrence of thick cloud cover. This problem has led to the use of data obtained by synthetic aperture radar (SAR), which is a sensor capable of all-day/all-weather observation for studies in the Tropics. In this study, the ALOS-2 (Advanced Land Observing Satellite) PALSAR-2 (Phased Array type L-band SAR) datasets for year 2015 were used as an input to a support vector machine (SVM) based machine learning algorithm. Oil palm/non-oil palm samples were collected using a hexagonal equal-area sampling design. High-resolution images in Google Earth and PALSAR-2 imagery were used in human photo-interpretation to separate oil palm from others (i.e. cropland, forest, grassland, shrubland, water, hard surface and bareland). The characteristics of oil palms from various aspects, including PALSAR-2 backscattering coefficients (HH, HV), terrain and climate by using this sample set were further explored to post-process the SVM output. The average accuracy of oil palm type is better than 80% in the final oil palm map for Malaysia.

  9. Automatic aortic root segmentation in CTA whole-body dataset

    Science.gov (United States)

    Gao, Xinpei; Kitslaar, Pieter H.; Scholte, Arthur J. H. A.; Lelieveldt, Boudewijn P. F.; Dijkstra, Jouke; Reiber, Johan H. C.

    2016-03-01

    Trans-catheter aortic valve replacement (TAVR) is an evolving technique for patients with serious aortic stenosis disease. Typically, in this application a CTA data set is obtained of the patient's arterial system from the subclavian artery to the femoral arteries, to evaluate the quality of the vascular access route and analyze the aortic root to determine if and which prosthesis should be used. In this paper, we concentrate on the automated segmentation of the aortic root. The purpose of this study was to automatically segment the aortic root in computed tomography angiography (CTA) datasets to support TAVR procedures. The method in this study includes 4 major steps. First, the patient's cardiac CTA image was resampled to reduce the computation time. Next, the cardiac CTA image was segmented using an atlas-based approach. The most similar atlas was selected from a total of 8 atlases based on its image similarity to the input CTA image. Third, the aortic root segmentation from the previous step was transferred to the patient's whole-body CTA image by affine registration and refined in the fourth step using a deformable subdivision surface model fitting procedure based on image intensity. The pipeline was applied to 20 patients. The ground truth was created by an analyst who semi-automatically corrected the contours of the automatic method, where necessary. The average Dice similarity index between the segmentations of the automatic method and the ground truth was found to be 0.965±0.024. In conclusion, the current results are very promising.

  10. Introduction of a simple-model-based land surface dataset for Europe

    Science.gov (United States)

    Orth, Rene; Seneviratne, Sonia I.

    2015-04-01

    Land surface hydrology can play a crucial role during extreme events such as droughts, floods and even heat waves. We introduce in this study a new hydrological dataset for Europe that consists of soil moisture, runoff and evapotranspiration (ET). It is derived with a simple water balance model (SWBM) forced with precipitation, temperature and net radiation. The SWBM dataset extends over the period 1984-2013 with a daily time step and 0.5° × 0.5° resolution. We employ a novel calibration approach, in which we consider 300 random parameter sets chosen from an observation-based range. Using several independent validation datasets representing soil moisture (or terrestrial water content), ET and streamflow, we identify the best performing parameter set and hence the new dataset. To illustrate its usefulness, the SWBM dataset is compared against several state-of-the-art datasets (ERA-Interim/Land, MERRA-Land, GLDAS-2-Noah, simulations of the Community Land Model Version 4), using all validation datasets as reference. For soil moisture dynamics it outperforms the benchmarks. Therefore the SWBM soil moisture dataset constitutes a reasonable alternative to sparse measurements, little validated model results, or proxy data such as precipitation indices. Also in terms of runoff the SWBM dataset performs well, whereas the evaluation of the SWBM ET dataset is overall satisfactory, but the dynamics are less well captured for this variable. This highlights the limitations of the dataset, as it is based on a simple model that uses uniform parameter values. Hence some processes impacting ET dynamics may not be captured, and quality issues may occur in regions with complex terrain. Even though the SWBM is well calibrated, it cannot replace more sophisticated models; but as their calibration is a complex task the present dataset may serve as a benchmark in future. In addition we investigate the sources of skill of the SWBM dataset and find that the parameter set has a similar

  11. VideoWeb Dataset for Multi-camera Activities and Non-verbal Communication

    Science.gov (United States)

    Denina, Giovanni; Bhanu, Bir; Nguyen, Hoang Thanh; Ding, Chong; Kamal, Ahmed; Ravishankar, Chinya; Roy-Chowdhury, Amit; Ivers, Allen; Varda, Brenda

    Human-activity recognition is one of the most challenging problems in computer vision. Researchers from around the world have tried to solve this problem and have come a long way in recognizing simple motions and atomic activities. As the computer vision community heads toward fully recognizing human activities, a challenging and labeled dataset is needed. To respond to that need, we collected a dataset of realistic scenarios in a multi-camera network environment (VideoWeb) involving multiple persons performing dozens of different repetitive and non-repetitive activities. This chapter describes the details of the dataset. We believe that this VideoWeb Activities dataset is unique and it is one of the most challenging datasets available today. The dataset is publicly available online at http://vwdata.ee.ucr.edu/ along with the data annotation.

  12. USGS Watershed Boundary Dataset (WBD) Overlay Map Service from The National Map - National Geospatial Data Asset (NGDA) Watershed Boundary Dataset (WBD)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The Watershed Boundary Dataset (WBD) from The National Map (TNM) defines the perimeter of drainage areas formed by the terrain and other landscape characteristics....

  13. The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 Catchments (Version 2.1) for the Conterminous United States: National Coal Resource Dataset System

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset represents the coal mine density and storage volumes within individual, local NHDPlusV2 catchments and upstream, contributing watersheds based on the...

  14. The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: National Anthropogenic Barrier Dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset represents the dam density and storage volumes within individual, local NHDPlusV2 catchments and upstream, contributing watersheds based on the National...

  15. The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: National Elevation Dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset represents the elevation values within individual local NHDPlusV2 catchments and upstream, contributing watersheds based on the National Elevation...

  16. Topic modeling for cluster analysis of large biological and medical datasets.

    Science.gov (United States)

    Zhao, Weizhong; Zou, Wen; Chen, James J

    2014-01-01

    The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting

  17. A new dataset and algorithm evaluation for mood estimation in music

    OpenAIRE

    Godec, Primož

    2014-01-01

    This thesis presents a new dataset of perceived and induced emotions for 200 audio clips. The gathered dataset provides users' perceived and induced emotions for each clip, the association of color, along with demographic and personal data, such as user's emotion state and emotion ratings, genre preference, music experience, among others. With an online survey we collected more than 7000 responses for a dataset of 200 audio excerpts, thus providing about 37 user responses per clip. The foc...

  18. Hydrological simulation of the Brahmaputra basin using global datasets

    Science.gov (United States)

    Bhattacharya, Biswa; Conway, Crystal; Craven, Joanne; Masih, Ilyas; Mazzolini, Maurizio; Shrestha, Shreedeepy; Ugay, Reyne; van Andel, Schalk Jan

    2017-04-01

    Brahmaputra River flows through China, India and Bangladesh to the Bay of Bengal and is one of the largest rivers of the world with a catchment size of 580K km2. The catchment is largely hilly and/or forested with sparse population and with limited urbanisation and economic activities. The catchment experiences heavy monsoon rainfall leading to very high flood discharges. Large inter-annual variation of discharge leading to flooding, erosion and morphological changes are among the major challenges. The catchment is largely ungauged; moreover, limited availability of hydro-meteorological data limits the possibility of carrying out evidence based research, which could provide trustworthy information for managing and when needed, controlling, the basin processes by the riparian countries for overall basin development. The paper presents initial results of a current research project on Brahmaputra basin. A set of hydrological and hydraulic models (SWAT, HMS, RAS) are developed by employing publicly available datasets of DEM, land use and soil and simulated using satellite based rainfall products, evapotranspiration and temperature estimates. Remotely sensed data are compared with sporadically available ground data. The set of models are able to produce catchment wide hydrological information that potentially can be used in the future in managing the basin's water resources. The model predications should be used with caution due to high level of uncertainty because the semi-calibrated models are developed with uncertain physical representation (e.g. cross-section) and simulated with global meteorological forcing (e.g. TRMM) with limited validation. Major scientific challenges are seen in producing robust information that can be reliably used in managing the basin. The information generated by the models are uncertain and as a result, instead of using them per se, they are used in improving the understanding of the catchment, and by running several scenarios with varying

  19. Investigating automated depth modelling of archaeo-magnetic datasets

    Science.gov (United States)

    Cheyney, Samuel; Hill, Ian; Linford, Neil; Leech, Christopher

    2010-05-01

    Magnetic surveying is a commonly used tool for first-pass non-invasive archaeological surveying, and is often used to target areas for more detailed geophysical investigation, or excavation. Quick and routine processing of magnetic datasets mean survey results are typically viewed as 2D greyscale maps and the shapes of anomalies are interpreted in terms of likely archaeological structures. This technique is simple, but ignores some of the information content of the data. The data collected using dense spatial sampling with modern precise instrumentation are capable of yielding numerical estimates of the depths to buried structures, and their physical properties. The magnetic field measured at the surface is a superposition of the responses to all anomalous magnetic susceptibilities in the subsurface, and is therefore capable of revealing a 3D model of the magnetic properties. The application of mathematical modelling techniques to very-near-surface surveys such as for archaeology is quite rare, however similar methods are routinely used in regional scale mineral exploration surveys. Inverse modelling techniques have inherent ambiguity due to the nature of the mathematical "inverse problem". Often, although a good fit to the recorded values can be obtained, the final model will be non-unique and may be heavily biased by the starting model provided. Also the run time and computer resources required can be restrictive. Our approach is to derive as much information as possible from the data directly, and use this to define a starting model for inversion. This addresses both the ambiguity of the inverse problem and reduces the task for the inversion computation. A number of alternative methods exist that can be used to obtain parameters for source bodies in potential field data. Here, methods involving the derivatives of the total magnetic field are used in association with advanced image processing techniques to outline the edges of anomalous bodies more accurately

  20. Reliability of Source Mechanisms for a Hydraulic Fracturing Dataset

    Science.gov (United States)

    Eyre, T.; Van der Baan, M.

    2016-12-01

    Non-double-couple components have been inferred for induced seismicity due to fluid injection, yet these components are often poorly constrained due to the acquisition geometry. Likewise non-double-couple components in microseismic recordings are not uncommon. Microseismic source mechanisms provide an insight into the fracturing behaviour of a hydraulically stimulated reservoir. However, source inversion in a hydraulic fracturing environment is complicated by the likelihood of volumetric contributions to the source due to the presence of high pressure fluids, which greatly increases the possible solution space and therefore the non-uniqueness of the solutions. Microseismic data is usually recorded on either 2D surface or borehole arrays of sensors. In many cases, surface arrays appear to constrain source mechanisms with high shear components, whereas borehole arrays tend to constrain more variable mechanisms including those with high tensile components. The abilities of each geometry to constrain the true source mechanisms are therefore called into question.The ability to distinguish between shear and tensile source mechanisms with different acquisition geometries is investigated using synthetic data. For both inversions, both P- and S- wave amplitudes recorded on three component sensors need to be included to obtain reliable solutions. Surface arrays appear to give more reliable solutions due to a greater sampling of the focal sphere, but in reality tend to record signals with a low signal to noise ratio. Borehole arrays can produce acceptable results, however the reliability is much more affected by relative source-receiver locations and source orientation, with biases produced in many of the solutions. Therefore more care must be taken when interpreting results.These findings are taken into account when interpreting a microseismic dataset of 470 events recorded by two vertical borehole arrays monitoring a horizontal treatment well. Source locations and

  1. Estimated Perennial Streams of Idaho and Related Geospatial Datasets

    Science.gov (United States)

    Rea, Alan; Skinner, Kenneth D.

    2009-01-01

    record, generally would be considered to represent flow conditions better at a given site than flow estimates based on regionalized regression models. The geospatial datasets of modeled perennial streams are considered a first-cut estimate, and should not be construed to override site-specific flow data.

  2. Scalable and portable visualization of large atomistic datasets

    Science.gov (United States)

    Sharma, Ashish; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya

    2004-10-01

    A scalable and portable code named Atomsviewer has been developed to interactively visualize a large atomistic dataset consisting of up to a billion atoms. The code uses a hierarchical view frustum-culling algorithm based on the octree data structure to efficiently remove atoms outside of the user's field-of-view. Probabilistic and depth-based occlusion-culling algorithms then select atoms, which have a high probability of being visible. Finally a multiresolution algorithm is used to render the selected subset of visible atoms at varying levels of detail. Atomsviewer is written in C++ and OpenGL, and it has been tested on a number of architectures including Windows, Macintosh, and SGI. Atomsviewer has been used to visualize tens of millions of atoms on a standard desktop computer and, in its parallel version, up to a billion atoms. Program summaryTitle of program: Atomsviewer Catalogue identifier: ADUM Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADUM Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Computer for which the program is designed and others on which it has been tested: 2.4 GHz Pentium 4/Xeon processor, professional graphics card; Apple G4 (867 MHz)/G5, professional graphics card Operating systems under which the program has been tested: Windows 2000/XP, Mac OS 10.2/10.3, SGI IRIX 6.5 Programming languages used: C++, C and OpenGL Memory required to execute with typical data: 1 gigabyte of RAM High speed storage required: 60 gigabytes No. of lines in the distributed program including test data, etc.: 550 241 No. of bytes in the distributed program including test data, etc.: 6 258 245 Number of bits in a word: Arbitrary Number of processors used: 1 Has the code been vectorized or parallelized: No Distribution format: tar gzip file Nature of physical problem: Scientific visualization of atomic systems Method of solution: Rendering of atoms using computer graphic techniques, culling algorithms for data

  3. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.

    Science.gov (United States)

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-07-02

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.

  4. Wehmas et al. 94-04 Toxicol Sci: Datasets for manuscript

    Data.gov (United States)

    U.S. Environmental Protection Agency — Dataset includes overview text document (accepted version of manuscript) and tables, figures, and supplementary materials. Supplementary tables provide summary data...

  5. A New Dataset Size Reduction Approach for PCA-Based Classification in OCR Application

    Directory of Open Access Journals (Sweden)

    Mohammad Amin Shayegan

    2014-01-01

    Full Text Available A major problem of pattern recognition systems is due to the large volume of training datasets including duplicate and similar training samples. In order to overcome this problem, some dataset size reduction and also dimensionality reduction techniques have been introduced. The algorithms presently used for dataset size reduction usually remove samples near to the centers of classes or support vector samples between different classes. However, the samples near to a class center include valuable information about the class characteristics and the support vector is important for evaluating system efficiency. This paper reports on the use of Modified Frequency Diagram technique for dataset size reduction. In this new proposed technique, a training dataset is rearranged and then sieved. The sieved training dataset along with automatic feature extraction/selection operation using Principal Component Analysis is used in an OCR application. The experimental results obtained when using the proposed system on one of the biggest handwritten Farsi/Arabic numeral standard OCR datasets, Hoda, show about 97% accuracy in the recognition rate. The recognition speed increased by 2.28 times, while the accuracy decreased only by 0.7%, when a sieved version of the dataset, which is only as half as the size of the initial training dataset, was used.

  6. Topographical effects of climate dataset and their impacts on the estimation of regional net primary productivity

    Science.gov (United States)

    Sun, L. Qing; Feng, Feng X.

    2014-11-01

    In this study, we first built and compared two different climate datasets for Wuling mountainous area in 2010, one of which considered topographical effects during the ANUSPLIN interpolation was referred as terrain-based climate dataset, while the other one did not was called ordinary climate dataset. Then, we quantified the topographical effects of climatic inputs on NPP estimation by inputting two different climate datasets to the same ecosystem model, the Boreal Ecosystem Productivity Simulator (BEPS), to evaluate the importance of considering relief when estimating NPP. Finally, we found the primary contributing variables to the topographical effects through a series of experiments given an overall accuracy of the model output for NPP. The results showed that: (1) The terrain-based climate dataset presented more reliable topographic information and had closer agreements with the station dataset than the ordinary climate dataset at successive time series of 365 days in terms of the daily mean values. (2) On average, ordinary climate dataset underestimated NPP by 12.5% compared with terrain-based climate dataset over the whole study area. (3) The primary climate variables contributing to the topographical effects of climatic inputs for Wuling mountainous area were temperatures, which suggest that it is necessary to correct temperature differences for estimating NPP accurately in such a complex terrain.

  7. A Data-Centered Collaboration Portal to Support Global Carbon-Flux Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Agarwal, Deborah A. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Humphrey, Marty [Univ. of Virginia, Charlottesville, VA (United States); Beekwilder, Norm [Univ. of Virginia, Charlottesville, VA (United States); Jackson, Keith [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Goode, Monte [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); van Ingen, Catharine [Microsoft. San Francisco, CA (United States)

    2009-04-07

    Carbon-climate, like other environmental sciences, has been changing. Large-scalesynthesis studies are becoming more common. These synthesis studies are often conducted by science teams that are geographically distributed and on datasets that are global in scale. A broad array of collaboration and data analytics tools are now available that could support these science teams. However, building tools that scientists actually use is hard. Also, moving scientists from an informal collaboration structure to one mediated by technology often exposes inconsistencies in the understanding of the rules of engagement between collaborators. We have developed a scientific collaboration portal, called fluxdata.org, which serves the community of scientists providing and analyzing the global FLUXNET carbon-flux synthesis dataset. Key things we learned or re-learned during our portal development include: minimize the barrier to entry, provide features on a just-in-time basis, development of requirements is an on-going process, provide incentives to change leaders and leverage the opportunity they represent, automate as much as possible, and you can only learn how to make it better if people depend on it enough to give you feedback. In addition, we also learned that splitting the portal roles between scientists and computer scientists improved user adoption and trust. The fluxdata.org portal has now been in operation for ~;;1.5 years and has become central to the FLUXNET synthesis efforts.

  8. Total synthesis of ciguatoxin.

    Science.gov (United States)

    Hamajima, Akinari; Isobe, Minoru

    2009-01-01

    Something fishy: Ciguatoxin (see structure) is one of the principal toxins involved in ciguatera poisoning and the target of a total synthesis involving the coupling of three segments. The key transformations in this synthesis feature acetylene-dicobalthexacarbonyl complexation.

  9. A novel dataset for real-life evaluation of facial expression recognition methodologies

    NARCIS (Netherlands)

    Siddiqi, Muhammad Hameed; Ali, Maqbool; Idris, Muhammad; Banos Legran, Oresti; Lee, Sungyoung; Choo, Hyunseung

    2016-01-01

    One limitation seen among most of the previous methods is that they were evaluated under settings that are far from real-life scenarios. The reason is that the existing facial expression recognition (FER) datasets are mostly pose-based and assume a predefined setup. The expressions in these datasets

  10. Full-Scale Approximations of Spatio-Temporal Covariance Models for Large Datasets

    KAUST Repository

    Zhang, Bohai; Sang, Huiyan; Huang, Jianhua Z.

    2014-01-01

    of dataset and application of such models is not feasible for large datasets. This article extends the full-scale approximation (FSA) approach by Sang and Huang (2012) to the spatio-temporal context to reduce computational complexity. A reversible jump Markov

  11. Gridded precipitation dataset for the Rhine basin made with the genRE interpolation method

    NARCIS (Netherlands)

    Osnabrugge, van B.; Uijlenhoet, R.

    2017-01-01

    A high resolution (1.2x1.2km) gridded precipitation dataset with hourly time step that covers the whole Rhine basin for the period 1997-2015. Made from gauge data with the genRE interpolation scheme. See "genRE: A method to extend gridded precipitation climatology datasets in near real-time for

  12. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

    KAUST Repository

    Mü ller, Matthias; Bibi, Adel Aamer; Giancola, Silvio; Al-Subaihi, Salman; Ghanem, Bernard

    2018-01-01

    Despite the numerous developments in object tracking, further development of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.

  13. Omicseq: a web-based search engine for exploring omics datasets

    Science.gov (United States)

    Sun, Xiaobo; Pittard, William S.; Xu, Tianlei; Chen, Li; Zwick, Michael E.; Jiang, Xiaoqian; Wang, Fusheng

    2017-01-01

    Abstract The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve ‘findability’ of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. PMID:28402462

  14. Document Questionnaires and Datasets with DDI: A Hands-On Introduction with Colectica

    OpenAIRE

    Iverson, Jeremy; Smith, Dan

    2018-01-01

    This workshop offers a hands-on, practical approach to creating and documenting both surveys and datasets with DDI and Colectica. Participants will build and field a DDI-driven survey using their own questions or samples provided in the workshop. They will then ingest, annotate, and publish DDI dataset descriptions using the collected survey data.

  15. An integrated pan-tropical biomass map using multiple reference datasets

    NARCIS (Netherlands)

    Avitabile, V.; Herold, M.; Heuvelink, G.B.M.; Lewis, S.L.; Phillips, O.L.; Asner, G.P.; Armston, J.; Asthon, P.; Banin, L.F.; Bayol, N.; Berry, N.; Boeckx, P.; Jong, De B.; Devries, B.; Girardin, C.; Kearsley, E.; Lindsell, J.A.; Lopez-gonzalez, G.; Lucas, R.; Malhi, Y.; Morel, A.; Mitchard, E.; Nagy, L.; Qie, L.; Quinones, M.; Ryan, C.M.; Slik, F.; Sunderland, T.; Vaglio Laurin, G.; Valentini, R.; Verbeeck, H.; Wijaya, A.; Willcock, S.

    2016-01-01

    We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of

  16. Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications

    Science.gov (United States)

    Maskey, M.; Ramachandran, R.; Miller, J.

    2017-12-01

    Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.

  17. SAR image classification based on CNN in real and simulation datasets

    Science.gov (United States)

    Peng, Lijiang; Liu, Ming; Liu, Xiaohua; Dong, Liquan; Hui, Mei; Zhao, Yuejin

    2018-04-01

    Convolution neural network (CNN) has made great success in image classification tasks. Even in the field of synthetic aperture radar automatic target recognition (SAR-ATR), state-of-art results has been obtained by learning deep representation of features on the MSTAR benchmark. However, the raw data of MSTAR have shortcomings in training a SAR-ATR model because of high similarity in background among the SAR images of each kind. This indicates that the CNN would learn the hierarchies of features of backgrounds as well as the targets. To validate the influence of the background, some other SAR images datasets have been made which contains the simulation SAR images of 10 manufactured targets such as tank and fighter aircraft, and the backgrounds of simulation SAR images are sampled from the whole original MSTAR data. The simulation datasets contain the dataset that the backgrounds of each kind images correspond to the one kind of backgrounds of MSTAR targets or clutters and the dataset that each image shares the random background of whole MSTAR targets or clutters. In addition, mixed datasets of MSTAR and simulation datasets had been made to use in the experiments. The CNN architecture proposed in this paper are trained on all datasets mentioned above. The experimental results shows that the architecture can get high performances on all datasets even the backgrounds of the images are miscellaneous, which indicates the architecture can learn a good representation of the targets even though the drastic changes on background.

  18. Resolution testing and limitations of geodetic and tsunami datasets for finite fault inversions along subduction zones

    Science.gov (United States)

    Williamson, A.; Newman, A. V.

    2017-12-01

    Finite fault inversions utilizing multiple datasets have become commonplace for large earthquakes pending data availability. The mixture of geodetic datasets such as Global Navigational Satellite Systems (GNSS) and InSAR, seismic waveforms, and when applicable, tsunami waveforms from Deep-Ocean Assessment and Reporting of Tsunami (DART) gauges, provide slightly different observations that when incorporated together lead to a more robust model of fault slip distribution. The merging of different datasets is of particular importance along subduction zones where direct observations of seafloor deformation over the rupture area are extremely limited. Instead, instrumentation measures related ground motion from tens to hundreds of kilometers away. The distance from the event and dataset type can lead to a variable degree of resolution, affecting the ability to accurately model the spatial distribution of slip. This study analyzes the spatial resolution attained individually from geodetic and tsunami datasets as well as in a combined dataset. We constrain the importance of distance between estimated parameters and observed data and how that varies between land-based and open ocean datasets. Analysis focuses on accurately scaled subduction zone synthetic models as well as analysis of the relationship between slip and data in recent large subduction zone earthquakes. This study shows that seafloor deformation sensitive datasets, like open-ocean tsunami waveforms or seafloor geodetic instrumentation, can provide unique offshore resolution for understanding most large and particularly tsunamigenic megathrust earthquake activity. In most environments, we simply lack the capability to resolve static displacements using land-based geodetic observations.

  19. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

    KAUST Repository

    Müller, Matthias

    2018-03-28

    Despite the numerous developments in object tracking, further development of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.

  20. A Dataset of Three Educational Technology Experiments on Differentiation, Formative Testing and Feedback

    Science.gov (United States)

    Haelermans, Carla; Ghysels, Joris; Prince, Fernao

    2015-01-01

    This paper describes a dataset with data from three individually randomized educational technology experiments on differentiation, formative testing and feedback during one school year for a group of 8th grade students in the Netherlands, using administrative data and the online motivation questionnaire of Boekaerts. The dataset consists of pre-…

  1. Developing predictive imaging biomarkers using whole-brain classifiers: Application to the ABIDE I dataset

    Directory of Open Access Journals (Sweden)

    Swati Rane

    2017-03-01

    Full Text Available We designed a modular machine learning program that uses functional magnetic resonance imaging (fMRI data in order to distinguish individuals with autism spectrum disorders from neurodevelopmentally normal individuals. Data was selected from the Autism Brain Imaging Dataset Exchange (ABIDE I Preprocessed Dataset.

  2. Omicseq: a web-based search engine for exploring omics datasets.

    Science.gov (United States)

    Sun, Xiaobo; Pittard, William S; Xu, Tianlei; Chen, Li; Zwick, Michael E; Jiang, Xiaoqian; Wang, Fusheng; Qin, Zhaohui S

    2017-07-03

    The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  3. Spatially continuous dataset at local scale of Taita Hills in Kenya and Mount Kilimanjaro in Tanzania

    Directory of Open Access Journals (Sweden)

    Sizah Mwalusepo

    2016-09-01

    Full Text Available Climate change is a global concern, requiring local scale spatially continuous dataset and modeling of meteorological variables. This dataset article provided the interpolated temperature, rainfall and relative humidity dataset at local scale along Taita Hills and Mount Kilimanjaro altitudinal gradients in Kenya and Tanzania, respectively. The temperature and relative humidity were recorded hourly using automatic onset THHOBO data loggers and rainfall was recorded daily using GENERALR wireless rain gauges. Thin plate spline (TPS was used to interpolate, with the degree of data smoothing determined by minimizing the generalized cross validation. The dataset provide information on the status of the current climatic conditions along the two mountainous altitudinal gradients in Kenya and Tanzania. The dataset will, thus, enhance future research. Keywords: Spatial climate data, Climate change, Modeling, Local scale

  4. PERFORMANCE COMPARISON FOR INTRUSION DETECTION SYSTEM USING NEURAL NETWORK WITH KDD DATASET

    Directory of Open Access Journals (Sweden)

    S. Devaraju

    2014-04-01

    Full Text Available Intrusion Detection Systems are challenging task for finding the user as normal user or attack user in any organizational information systems or IT Industry. The Intrusion Detection System is an effective method to deal with the kinds of problem in networks. Different classifiers are used to detect the different kinds of attacks in networks. In this paper, the performance of intrusion detection is compared with various neural network classifiers. In the proposed research the four types of classifiers used are Feed Forward Neural Network (FFNN, Generalized Regression Neural Network (GRNN, Probabilistic Neural Network (PNN and Radial Basis Neural Network (RBNN. The performance of the full featured KDD Cup 1999 dataset is compared with that of the reduced featured KDD Cup 1999 dataset. The MATLAB software is used to train and test the dataset and the efficiency and False Alarm Rate is measured. It is proved that the reduced dataset is performing better than the full featured dataset.

  5. A global gridded dataset of daily precipitation going back to 1950, ideal for analysing precipitation extremes

    Science.gov (United States)

    Contractor, S.; Donat, M.; Alexander, L. V.

    2017-12-01

    Reliable observations of precipitation are necessary to determine past changes in precipitation and validate models, allowing for reliable future projections. Existing gauge based gridded datasets of daily precipitation and satellite based observations contain artefacts and have a short length of record, making them unsuitable to analyse precipitation extremes. The largest limiting factor for the gauge based datasets is a dense and reliable station network. Currently, there are two major data archives of global in situ daily rainfall data, first is Global Historical Station Network (GHCN-Daily) hosted by National Oceanic and Atmospheric Administration (NOAA) and the other by Global Precipitation Climatology Centre (GPCC) part of the Deutsche Wetterdienst (DWD). We combine the two data archives and use automated quality control techniques to create a reliable long term network of raw station data, which we then interpolate using block kriging to create a global gridded dataset of daily precipitation going back to 1950. We compare our interpolated dataset with existing global gridded data of daily precipitation: NOAA Climate Prediction Centre (CPC) Global V1.0 and GPCC Full Data Daily Version 1.0, as well as various regional datasets. We find that our raw station density is much higher than other datasets. To avoid artefacts due to station network variability, we provide multiple versions of our dataset based on various completeness criteria, as well as provide the standard deviation, kriging error and number of stations for each grid cell and timestep to encourage responsible use of our dataset. Despite our efforts to increase the raw data density, the in situ station network remains sparse in India after the 1960s and in Africa throughout the timespan of the dataset. Our dataset would allow for more reliable global analyses of rainfall including its extremes and pave the way for better global precipitation observations with lower and more transparent uncertainties.

  6. Resampling Methods Improve the Predictive Power of Modeling in Class-Imbalanced Datasets

    Directory of Open Access Journals (Sweden)

    Paul H. Lee

    2014-09-01

    Full Text Available In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES wave 2009–2010 dataset. A total of 4677 participants aged ≥20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset, and area under the receiver operating characteristic curve (AUC was computed using the remaining 30% of the sample for evaluation (testing dataset. CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees were also examined. CARTs fitted on the oversampled (AUC = 0.70 and undersampled training data (AUC = 0.74 yielded a better classification power than that on the training data (AUC = 0.65. Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests

  7. Facing the Challenges of Accessing, Managing, and Integrating Large Observational Datasets in Ecology: Enabling and Enriching the Use of NEON's Observational Data

    Science.gov (United States)

    Thibault, K. M.

    2013-12-01

    As the construction of NEON and its transition to operations progresses, more and more data will become available to the scientific community, both from NEON directly and from the concomitant growth of existing data repositories. Many of these datasets include ecological observations of a diversity of taxa in both aquatic and terrestrial environments. Although observational data have been collected and used throughout the history of organismal biology, the field has not yet fully developed a culture of data management, documentation, standardization, sharing and discoverability to facilitate the integration and synthesis of datasets. Moreover, the tools required to accomplish these goals, namely database design, implementation, and management, and automation and parallelization of analytical tasks through computational techniques, have not historically been included in biology curricula, at either the undergraduate or graduate levels. To ensure the success of data-generating projects like NEON in advancing organismal ecology and to increase transparency and reproducibility of scientific analyses, an acceleration of the cultural shift to open science practices, the development and adoption of data standards, such as the DarwinCore standard for taxonomic data, and increased training in computational approaches for biologists need to be realized. Here I highlight several initiatives that are intended to increase access to and discoverability of publicly available datasets and equip biologists and other scientists with the skills that are need to manage, integrate, and analyze data from multiple large-scale projects. The EcoData Retriever (ecodataretriever.org) is a tool that downloads publicly available datasets, re-formats the data into an efficient relational database structure, and then automatically imports the data tables onto a user's local drive into the database tool of the user's choice. The automation of these tasks results in nearly instantaneous execution

  8. Sensitivity of a numerical wave model on wind re-analysis datasets

    Science.gov (United States)

    Lavidas, George; Venugopal, Vengatesan; Friedrich, Daniel

    2017-03-01

    Wind is the dominant process for wave generation. Detailed evaluation of metocean conditions strengthens our understanding of issues concerning potential offshore applications. However, the scarcity of buoys and high cost of monitoring systems pose a barrier to properly defining offshore conditions. Through use of numerical wave models, metocean conditions can be hindcasted and forecasted providing reliable characterisations. This study reports the sensitivity of wind inputs on a numerical wave model for the Scottish region. Two re-analysis wind datasets with different spatio-temporal characteristics are used, the ERA-Interim Re-Analysis and the CFSR-NCEP Re-Analysis dataset. Different wind products alter results, affecting the accuracy obtained. The scope of this study is to assess different available wind databases and provide information concerning the most appropriate wind dataset for the specific region, based on temporal, spatial and geographic terms for wave modelling and offshore applications. Both wind input datasets delivered results from the numerical wave model with good correlation. Wave results by the 1-h dataset have higher peaks and lower biases, in expense of a high scatter index. On the other hand, the 6-h dataset has lower scatter but higher biases. The study shows how wind dataset affects the numerical wave modelling performance, and that depending on location and study needs, different wind inputs should be considered.

  9. Knowledge Mining from Clinical Datasets Using Rough Sets and Backpropagation Neural Network

    Directory of Open Access Journals (Sweden)

    Kindie Biredagn Nahato

    2015-01-01

    Full Text Available The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.

  10. An assessment of differences in gridded precipitation datasets in complex terrain

    Science.gov (United States)

    Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.

    2018-01-01

    Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.

  11. Assessment of NASA's Physiographic and Meteorological Datasets as Input to HSPF and SWAT Hydrological Models

    Science.gov (United States)

    Alacron, Vladimir J.; Nigro, Joseph D.; McAnally, William H.; OHara, Charles G.; Engman, Edwin Ted; Toll, David

    2011-01-01

    This paper documents the use of simulated Moderate Resolution Imaging Spectroradiometer land use/land cover (MODIS-LULC), NASA-LIS generated precipitation and evapo-transpiration (ET), and Shuttle Radar Topography Mission (SRTM) datasets (in conjunction with standard land use, topographical and meteorological datasets) as input to hydrological models routinely used by the watershed hydrology modeling community. The study is focused in coastal watersheds in the Mississippi Gulf Coast although one of the test cases focuses in an inland watershed located in northeastern State of Mississippi, USA. The decision support tools (DSTs) into which the NASA datasets were assimilated were the Soil Water & Assessment Tool (SWAT) and the Hydrological Simulation Program FORTRAN (HSPF). These DSTs are endorsed by several US government agencies (EPA, FEMA, USGS) for water resources management strategies. These models use physiographic and meteorological data extensively. Precipitation gages and USGS gage stations in the region were used to calibrate several HSPF and SWAT model applications. Land use and topographical datasets were swapped to assess model output sensitivities. NASA-LIS meteorological data were introduced in the calibrated model applications for simulation of watershed hydrology for a time period in which no weather data were available (1997-2006). The performance of the NASA datasets in the context of hydrological modeling was assessed through comparison of measured and model-simulated hydrographs. Overall, NASA datasets were as useful as standard land use, topographical , and meteorological datasets. Moreover, NASA datasets were used for performing analyses that the standard datasets could not made possible, e.g., introduction of land use dynamics into hydrological simulations

  12. FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.

    Science.gov (United States)

    Shcherbina, Anna

    2014-08-15

    High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible. FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step. FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge

  13. Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

    Science.gov (United States)

    Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

    2014-01-01

    Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution

  14. Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.

    Science.gov (United States)

    Kohli, Marc D; Summers, Ronald M; Geis, J Raymond

    2017-08-01

    At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities.

  15. Map Coordinate Referencing and the use of GPS Datasets in Ghana ...

    African Journals Online (AJOL)

    Map Coordinate Referencing and the use of GPS Datasets in Ghana. ... Journal of Science and Technology (Ghana) ... systems used in Ghana (the Ghana war office system and also the Clarke1880 system) using the Bursa-Wolf model.

  16. Climate Prediction Center(CPC)Infra-Red (IR) 0.5 degree Dataset

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Climate Prediction Center 0.5 degree IR dataset was created from all available individual geostationary satellite data which have been merged to form nearly seamless...

  17. USGS HYDRoacoustic dataset in support of the Surface Water Oceanographic Topography satellite mission (HYDRoSWOT)

    Data.gov (United States)

    Department of the Interior — HYDRoSWOT – HYDRoacoustic dataset in support of Surface Water Oceanographic Topography – is a data set that aggregates channel and flow data collected from the USGS...

  18. Vector Nonlinear Time-Series Analysis of Gamma-Ray Burst Datasets on Heterogeneous Clusters

    Directory of Open Access Journals (Sweden)

    Ioana Banicescu

    2005-01-01

    Full Text Available The simultaneous analysis of a number of related datasets using a single statistical model is an important problem in statistical computing. A parameterized statistical model is to be fitted on multiple datasets and tested for goodness of fit within a fixed analytical framework. Definitive conclusions are hopefully achieved by analyzing the datasets together. This paper proposes a strategy for the efficient execution of this type of analysis on heterogeneous clusters. Based on partitioning processors into groups for efficient communications and a dynamic loop scheduling approach for load balancing, the strategy addresses the variability of the computational loads of the datasets, as well as the unpredictable irregularities of the cluster environment. Results from preliminary tests of using this strategy to fit gamma-ray burst time profiles with vector functional coefficient autoregressive models on 64 processors of a general purpose Linux cluster demonstrate the effectiveness of the strategy.

  19. Dataset of Passerine bird communities in a Mediterranean high mountain (Sierra Nevada, Spain)

    Science.gov (United States)

    Pérez-Luque, Antonio Jesús; Barea-Azcón, José Miguel; Álvarez-Ruiz, Lola; Bonet-García, Francisco Javier; Zamora, Regino

    2016-01-01

    Abstract In this data paper, a dataset of passerine bird communities is described in Sierra Nevada, a Mediterranean high mountain located in southern Spain. The dataset includes occurrence data from bird surveys conducted in four representative ecosystem types of Sierra Nevada from 2008 to 2015. For each visit, bird species numbers as well as distance to the transect line were recorded. A total of 27847 occurrence records were compiled with accompanying measurements on distance to the transect and animal counts. All records are of species in the order Passeriformes. Records of 16 different families and 44 genera were collected. Some of the taxa in the dataset are included in the European Red List. This dataset belongs to the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:26865820

  20. Telephone Interpreter Services (TIS)-Asian and Pacific Islander (API) Language Yearly Dataset

    Data.gov (United States)

    Social Security Administration — This dataset displays our national TIS call volume for over 45 API languages for fiscal year 2011 onward. A fiscal year runs from October through September. We will...

  1. Telephone Interpreter Services (TIS) - Asian and Pacific Islander (API) Language Fiscal Year Quarterly Dataset

    Data.gov (United States)

    Social Security Administration — This dataset displays our quarterly national TIS call volume for over 45 API languages for fiscal year 2013 onward. A fiscal year runs from October through September...

  2. Dataset of Passerine bird communities in a Mediterranean high mountain (Sierra Nevada, Spain).

    Science.gov (United States)

    Pérez-Luque, Antonio Jesús; Barea-Azcón, José Miguel; Álvarez-Ruiz, Lola; Bonet-García, Francisco Javier; Zamora, Regino

    2016-01-01

    In this data paper, a dataset of passerine bird communities is described in Sierra Nevada, a Mediterranean high mountain located in southern Spain. The dataset includes occurrence data from bird surveys conducted in four representative ecosystem types of Sierra Nevada from 2008 to 2015. For each visit, bird species numbers as well as distance to the transect line were recorded. A total of 27847 occurrence records were compiled with accompanying measurements on distance to the transect and animal counts. All records are of species in the order Passeriformes. Records of 16 different families and 44 genera were collected. Some of the taxa in the dataset are included in the European Red List. This dataset belongs to the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area.

  3. A multimodal dataset for authoring and editing multimedia content: The MAMEM project

    Directory of Open Access Journals (Sweden)

    Spiros Nikolopoulos

    2017-12-01

    Full Text Available We present a dataset that combines multimodal biosignals and eye tracking information gathered under a human-computer interaction framework. The dataset was developed in the vein of the MAMEM project that aims to endow people with motor disabilities with the ability to edit and author multimedia content through mental commands and gaze activity. The dataset includes EEG, eye-tracking, and physiological (GSR and Heart rate signals collected from 34 individuals (18 able-bodied and 16 motor-impaired. Data were collected during the interaction with specifically designed interface for web browsing and multimedia content manipulation and during imaginary movement tasks. The presented dataset will contribute towards the development and evaluation of modern human-computer interaction systems that would foster the integration of people with severe motor impairments back into society.

  4. Toxics Release Inventory Chemical Hazard Information Profiles (TRI-CHIP) Dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — The Toxics Release Inventory (TRI) Chemical Hazard Information Profiles (TRI-CHIP) dataset contains hazard information about the chemicals reported in TRI. Users can...

  5. Integration of geophysical datasets by a conjoint probability tomography approach: application to Italian active volcanic areas

    Directory of Open Access Journals (Sweden)

    D. Patella

    2008-06-01

    Full Text Available We expand the theory of probability tomography to the integration of different geophysical datasets. The aim of the new method is to improve the information quality using a conjoint occurrence probability function addressed to highlight the existence of common sources of anomalies. The new method is tested on gravity, magnetic and self-potential datasets collected in the volcanic area of Mt. Vesuvius (Naples, and on gravity and dipole geoelectrical datasets collected in the volcanic area of Mt. Etna (Sicily. The application demonstrates that, from a probabilistic point of view, the integrated analysis can delineate the signature of some important volcanic targets better than the analysis of the tomographic image of each dataset considered separately.

  6. Potential Impacts of Climate Change on World Food Supply: Datasets from a Major Crop Modeling Study

    Data.gov (United States)

    National Aeronautics and Space Administration — Datasets from a Major Crop Modeling Study contain projected country and regional changes in grain crop yields due to global climate change. Equilibrium and transient...

  7. CoVennTree: A new method for the comparative analysis of large datasets

    Directory of Open Access Journals (Sweden)

    Steffen C. Lott

    2015-02-01

    Full Text Available The visualization of massive datasets, such as those resulting from comparative metatranscriptome analyses or the analysis of microbial population structures using ribosomal RNA sequences, is a challenging task. We developed a new method called CoVennTree (Comparative weighted Venn Tree that simultaneously compares up to three multifarious datasets by aggregating and propagating information from the bottom to the top level and produces a graphical output in Cytoscape. With the introduction of weighted Venn structures, the contents and relationships of various datasets can be correlated and simultaneously aggregated without losing information. We demonstrate the suitability of this approach using a dataset of 16S rDNA sequences obtained from microbial populations at three different depths of the Gulf of Aqaba in the Red Sea. CoVennTree has been integrated into the Galaxy ToolShed and can be directly downloaded and integrated into the user instance.

  8. Statistical exploration of dataset examining key indicators influencing housing and urban infrastructure investments in megacities

    Directory of Open Access Journals (Sweden)

    Adedeji O. Afolabi

    2018-06-01

    Full Text Available Lagos, by the UN standards, has attained the megacity status, with the attendant challenges of living up to that titanic position; regrettably it struggles with its present stock of housing and infrastructural facilities to match its new status. Based on a survey of construction professionals’ perception residing within the state, a questionnaire instrument was used to gather the dataset. The statistical exploration contains dataset on the state of housing and urban infrastructural deficit, key indicators spurring the investment by government to upturn the deficit and improvement mechanisms to tackle the infrastructural dearth. Descriptive statistics and inferential statistics were used to present the dataset. The dataset when analyzed can be useful for policy makers, local and international governments, world funding bodies, researchers and infrastructural investors. Keywords: Construction, Housing, Megacities, Population, Urban infrastructures

  9. EPA Office of Water (OW): Fish Consumption Advisories and Fish Tissue Sampling Stations NHDPlus Indexed Datasets

    Data.gov (United States)

    U.S. Environmental Protection Agency — The Fish Consumption Advisories dataset contains information on Fish Advisory events that have been indexed to the EPA Office of Water NHDPlus v2.1 hydrology and...

  10. LenoxKaplan_Role of natural gas in meeting electric sector emissions reduction strategy_dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset is for an analysis that used the MARKAL linear optimization model to compare the carbon emissions profiles and system-wide global warming potential of...

  11. National Hydrography Dataset Plus High Resolution (NHDPlus HR) - USGS National Map Downloadable Data Collection

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The High Resolution National Hydrography Dataset Plus (NHDPlus HR) is an integrated set of geospatial data layers, including the best available National Hydrography...

  12. Computational Methods for Large Spatio-temporal Datasets and Functional Data Ranking

    KAUST Repository

    Huang, Huang

    2017-01-01

    that are both computationally and statistically efficient. We explore the improvement of the approximation theoretically and investigate the performance by simulations. For real applications, we analyze a soil moisture dataset with 2 million measurements

  13. Accounting for inertia in modal choices: some new evidence using a RP/SP dataset

    DEFF Research Database (Denmark)

    Cherchi, Elisabetta; Manca, Francesco

    2011-01-01

    effect is stable along the SP experiments. Inertia has been studied more extensively with panel datasets, but few investigations have used RP/SP datasets. In this paper we extend previous work in several ways. We test and compare several ways of measuring inertia, including measures that have been...... proposed for both short and long RP panel datasets. We also explore new measures of inertia to test for the effect of “learning” (in the sense of acquiring experience or getting more familiar with) along the SP experiment and we disentangle this effect from the pure inertia effect. A mixed logit model...... is used that allows us to account for both systematic and random taste variations in the inertia effect and for correlations among RP and SP observations. Finally we explore the relation between the utility specification (especially in the SP dataset) and the role of inertia in explaining current choices....

  14. MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

    Data.gov (United States)

    National Aeronautics and Space Administration — MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract....

  15. Incorporated City Boundaries in Iowa in 2010 as Derived from the Census Places Dataset

    Data.gov (United States)

    Iowa State University GIS Support and Research Facility — Incorporated Cities in Iowa in 2010, as derived from the Census Places dataset. Original abstract: The TIGER/Line Files are shapefiles and related database files...

  16. AFSC/REFM: Seabird food habits dataset of the North Pacific

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — The seabird food habits dataset contains information on the stomach contents from seabird specimens that were collected under salvage and scientific collection...

  17. IPCC IS92 Emissions Scenarios (A, B, C, D, E, F) Dataset Version 1.1

    Data.gov (United States)

    National Aeronautics and Space Administration — The Intergovernmental Panel on Climate Change (IPCC) IS92 Emissions Scenarios (A, B, C, D, E, F) Dataset Version 1.1 consists of six global and regional greenhouse...

  18. Dataset for Testing Contamination Source Identification Methods for Water Distribution Networks

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset includes the results of a simulation study using the source inversion techniques available in the Water Security Toolkit. The data was created to test...

  19. EPA Office of Water (OW): 2002 Impaired Waters Baseline NHDPlus Indexed Dataset

    Data.gov (United States)

    U.S. Environmental Protection Agency — This dataset consists of geospatial and attribute data identifying the spatial extent of state-reported impaired waters (EPA's Integrated Reporting categories 4a,...

  20. Datasets used in ORD-018902: Bisphenol A alternatives can effectively substitute for estradiol

    Data.gov (United States)

    U.S. Environmental Protection Agency — Gene Expression Omnibus numbers only. This dataset is associated with the following publication: Mesnage, R., A. Phedonos, M. Arno, S. Balu, C. Corton, and M....

  1. Evaluation of Modified Categorical Data Fuzzy Clustering Algorithm on the Wisconsin Breast Cancer Dataset

    Directory of Open Access Journals (Sweden)

    Amir Ahmad

    2016-01-01

    Full Text Available The early diagnosis of breast cancer is an important step in a fight against the disease. Machine learning techniques have shown promise in improving our understanding of the disease. As medical datasets consist of data points which cannot be precisely assigned to a class, fuzzy methods have been useful for studying of these datasets. Sometimes breast cancer datasets are described by categorical features. Many fuzzy clustering algorithms have been developed for categorical datasets. However, in most of these methods Hamming distance is used to define the distance between the two categorical feature values. In this paper, we use a probabilistic distance measure for the distance computation among a pair of categorical feature values. Experiments demonstrate that the distance measure performs better than Hamming distance for Wisconsin breast cancer data.

  2. Multivariate Analysis of Multiple Datasets: a Practical Guide for Chemical Ecology.

    Science.gov (United States)

    Hervé, Maxime R; Nicolè, Florence; Lê Cao, Kim-Anh

    2018-03-01

    Chemical ecology has strong links with metabolomics, the large-scale study of all metabolites detectable in a biological sample. Consequently, chemical ecologists are often challenged by the statistical analyses of such large datasets. This holds especially true when the purpose is to integrate multiple datasets to obtain a holistic view and a better understanding of a biological system under study. The present article provides a comprehensive resource to analyze such complex datasets using multivariate methods. It starts from the necessary pre-treatment of data including data transformations and distance calculations, to the application of both gold standard and novel multivariate methods for the integration of different omics data. We illustrate the process of analysis along with detailed results interpretations for six issues representative of the different types of biological questions encountered by chemical ecologists. We provide the necessary knowledge and tools with reproducible R codes and chemical-ecological datasets to practice and teach multivariate methods.

  3. New fuzzy support vector machine for the class imbalance problem in medical datasets classification.

    Science.gov (United States)

    Gu, Xiaoqing; Ni, Tongguang; Wang, Hongyuan

    2014-01-01

    In medical datasets classification, support vector machine (SVM) is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM) for the class imbalance problem (called FSVM-CIP) is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.

  4. New Fuzzy Support Vector Machine for the Class Imbalance Problem in Medical Datasets Classification

    Directory of Open Access Journals (Sweden)

    Xiaoqing Gu

    2014-01-01

    Full Text Available In medical datasets classification, support vector machine (SVM is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM for the class imbalance problem (called FSVM-CIP is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.

  5. A Dataset of Aerial Survey Counts of Harbor Seals in Iliamna Lake, Alaska: 1984-2013

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — This dataset provides counts of harbor seals from aerial surveys over Iliamna Lake, Alaska, USA. The data have been collated from three previously published sources...

  6. Political Township and Incorporated City Boundaries in Iowa in 2010 as Derived from Census Datasets

    Data.gov (United States)

    Iowa State University GIS Support and Research Facility — Political Township and Incorporated City Boundaries in Iowa in 2010, as Derived from Census Datasets Original Abstract: The TIGER/Line Files are shapefiles and...

  7. Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

    Directory of Open Access Journals (Sweden)

    H. E. Beck

    2017-12-01

    Full Text Available We undertook a comprehensive evaluation of 22 gridded (quasi-global (sub-daily precipitation (P datasets for the period 2000–2016. Thirteen non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76 086 gauges worldwide. Another nine gauge-corrected datasets were evaluated using hydrological modeling, by calibrating the HBV conceptual model against streamflow records for each of 9053 small to medium-sized ( <  50 000 km2 catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR and the satellite- and reanalysis-based CHIRP V2.0 dataset, the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7 or near-surface soil moisture (SM2RAIN-ASCAT, and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS. Two of the three reanalyses (ERA-Interim and JRA-55 unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified, and MSWEP V1.2 and V2.0 generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU, which in turn outperformed the one indirectly incorporating gauge data through another multi-source dataset (PERSIANN-CDR V1R1. Our results highlight large differences in estimation accuracy

  8. The Lunar Source Disk: Old Lunar Datasets on a New CD-ROM

    Science.gov (United States)

    Hiesinger, H.

    1998-01-01

    A compilation of previously published datasets on CD-ROM is presented. This Lunar Source Disk is intended to be a first step in the improvement/expansion of the Lunar Consortium Disk, in order to create an "image-cube"-like data pool that can be easily accessed and might be useful for a variety of future lunar investigations. All datasets were transformed to a standard map projection that allows direct comparison of different types of information on a pixel-by pixel basis. Lunar observations have a long history and have been important to mankind for centuries, notably since the work of Plutarch and Galileo. As a consequence of centuries of lunar investigations, knowledge of the characteristics and properties of the Moon has accumulated over time. However, a side effect of this accumulation is that it has become more and more complicated for scientists to review all the datasets obtained through different techniques, to interpret them properly, to recognize their weaknesses and strengths in detail, and to combine them synoptically in geologic interpretations. Such synoptic geologic interpretations are crucial for the study of planetary bodies through remote-sensing data in order to avoid misinterpretation. In addition, many of the modem datasets, derived from Earth-based telescopes as well as from spacecraft missions, are acquired at different geometric and radiometric conditions. These differences make it challenging to compare or combine datasets directly or to extract information from different datasets on a pixel-by-pixel basis. Also, as there is no convention for the presentation of lunar datasets, different authors choose different map projections, depending on the location of the investigated areas and their personal interests. Insufficient or incomplete information on the map parameters used by different authors further complicates the reprojection of these datasets to a standard geometry. The goal of our efforts was to transfer previously published lunar

  9. EVALUATION OF LAND USE/LAND COVER DATASETS FOR URBAN WATERSHED MODELING

    International Nuclear Information System (INIS)

    S.J. BURIAN; M.J. BROWN; T.N. MCPHERSON

    2001-01-01

    Land use/land cover (LULC) data are a vital component for nonpoint source pollution modeling. Most watershed hydrology and pollutant loading models use, in some capacity, LULC information to generate runoff and pollutant loading estimates. Simple equation methods predict runoff and pollutant loads using runoff coefficients or pollutant export coefficients that are often correlated to LULC type. Complex models use input variables and parameters to represent watershed characteristics and pollutant buildup and washoff rates as a function of LULC type. Whether using simple or complex models an accurate LULC dataset with an appropriate spatial resolution and level of detail is paramount for reliable predictions. The study presented in this paper compared and evaluated several LULC dataset sources for application in urban environmental modeling. The commonly used USGS LULC datasets have coarser spatial resolution and lower levels of classification than other LULC datasets. In addition, the USGS datasets do not accurately represent the land use in areas that have undergone significant land use change during the past two decades. We performed a watershed modeling analysis of three urban catchments in Los Angeles, California, USA to investigate the relative difference in average annual runoff volumes and total suspended solids (TSS) loads when using the USGS LULC dataset versus using a more detailed and current LULC dataset. When the two LULC datasets were aggregated to the same land use categories, the relative differences in predicted average annual runoff volumes and TSS loads from the three catchments were 8 to 14% and 13 to 40%, respectively. The relative differences did not have a predictable relationship with catchment size

  10. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    DEFF Research Database (Denmark)

    Jensen, Tue Vissing; Pinson, Pierre

    2017-01-01

    , we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven...... to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecastingof renewable power generation....

  11. Do Higher Government Wages Reduce Corruption? Evidence Based on a Novel Dataset

    OpenAIRE

    Le, Van-Ha; de Haan, Jakob; Dietzenbacher, Erik

    2013-01-01

    This paper employs a novel dataset on government wages to investigate the relationship between government remuneration policy and corruption. Our dataset, as derived from national household or labor surveys, is more reliable than the data on government wages as used in previous research. When the relationship between government wages and corruption is modeled to vary with the level of income, we find that the impact of government wages on corruption is strong at relatively low-income levels.

  12. An integrated pan-tropical biomass map using multiple reference datasets

    OpenAIRE

    Avitabile, V.; Herold, M.; Heuvelink, G. B. M.; Lewis, S. L.; Phillips, O. L.; Asner, G. P.; Armston, J.; Ashton, P. S.; Banin, L.; Bayol, N.; Berry, N. J.; Boeckx, P.; de Jong, B. H. J.; DeVries, B.; Girardin, C. A. J.

    2016-01-01

    We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of field observations and locally calibrated high-resolution biomass maps, harmonized and upscaled to 14 477 1-km AGB estimates. Our data fusion approach uses bias removal and weighted linear averaging...

  13. Dataset of Phenology of Mediterranean high-mountain meadows flora (Sierra Nevada, Spain)

    OpenAIRE

    Antonio Jesús Pérez-Luque; Cristina Patricia Sánchez-Rojas; Regino Zamora; Ramón Pérez-Pérez; Francisco Javier Bonet

    2015-01-01

    Abstract Sierra Nevada mountain range (southern Spain) hosts a high number of endemic plant species, being one of the most important biodiversity hotspots in the Mediterranean basin. The high-mountain meadow ecosystems (borreguiles) harbour a large number of endemic and threatened plant species. In this data paper, we describe a dataset of the flora inhabiting this threatened ecosystem in this Mediterranean mountain. The dataset includes occurrence data for flora collected in those ecosystems...

  14. Creating a Regional MODIS Satellite-Driven Net Primary Production Dataset for European Forests

    OpenAIRE

    Neumann, Mathias; Moreno, Adam; Thurnher, Christopher; Mues, Volker; Härkönen, Sanna; Mura, Matteo; Bouriaud, Olivier; Lang, Mait; Cardellini, Giuseppe; Thivolle-Cazat, Alain; Bronisz, Karol; Merganic, Jan; Alberdi, Iciar; Astrup, Rasmus; Mohren, Frits

    2016-01-01

    Net primary production (NPP) is an important ecological metric for studying forest ecosystems and their carbon sequestration, for assessing the potential supply of food or timber and quantifying the impacts of climate change on ecosystems. The global MODIS NPP dataset using the MOD17 algorithm provides valuable information for monitoring NPP at 1-km resolution. Since coarse-resolution global climate data are used, the global dataset may contain uncertainties for Europe. We used a 1-km daily g...

  15. Creating a Regional MODIS Satellite-Driven Net Primary Production Dataset for European Forests

    Directory of Open Access Journals (Sweden)

    Mathias Neumann

    2016-06-01

    Full Text Available Net primary production (NPP is an important ecological metric for studying forest ecosystems and their carbon sequestration, for assessing the potential supply of food or timber and quantifying the impacts of climate change on ecosystems. The global MODIS NPP dataset using the MOD17 algorithm provides valuable information for monitoring NPP at 1-km resolution. Since coarse-resolution global climate data are used, the global dataset may contain uncertainties for Europe. We used a 1-km daily gridded European climate data set with the MOD17 algorithm to create the regional NPP dataset MODIS EURO. For evaluation of this new dataset, we compare MODIS EURO with terrestrial driven NPP from analyzing and harmonizing forest inventory data (NFI from 196,434 plots in 12 European countries as well as the global MODIS NPP dataset for the years 2000 to 2012. Comparing these three NPP datasets, we found that the global MODIS NPP dataset differs from NFI NPP by 26%, while MODIS EURO only differs by 7%. MODIS EURO also agrees with NFI NPP across scales (from continental, regional to country and gradients (elevation, location, tree age, dominant species, etc.. The agreement is particularly good for elevation, dominant species or tree height. This suggests that using improved climate data allows the MOD17 algorithm to provide realistic NPP estimates for Europe. Local discrepancies between MODIS EURO and NFI NPP can be related to differences in stand density due to forest management and the national carbon estimation methods. With this study, we provide a consistent, temporally continuous and spatially explicit productivity dataset for the years 2000 to 2012 on a 1-km resolution, which can be used to assess climate change impacts on ecosystems or the potential biomass supply of the European forests for an increasing bio-based economy. MODIS EURO data are made freely available at ftp://palantir.boku.ac.at/Public/MODIS_EURO.

  16. Bayesian Method for Building Frequent Landsat-Like NDVI Datasets by Integrating MODIS and Landsat NDVI

    OpenAIRE

    Limin Liao; Jinling Song; Jindi Wang; Zhiqiang Xiao; Jian Wang

    2016-01-01

    Studies related to vegetation dynamics in heterogeneous landscapes often require Normalized Difference Vegetation Index (NDVI) datasets with both high spatial resolution and frequent coverage, which cannot be satisfied by a single sensor due to technical limitations. In this study, we propose a new method called NDVI-Bayesian Spatiotemporal Fusion Model (NDVI-BSFM) for accurately and effectively building frequent high spatial resolution Landsat-like NDVI datasets by integrating Moderate Resol...

  17. Calculation of climatic reference values and its use for automatic outlier detection in meteorological datasets

    Directory of Open Access Journals (Sweden)

    B. Téllez

    2008-04-01

    Full Text Available The climatic reference values for monthly and annual average air temperature and total precipitation in Catalonia – northeast of Spain – are calculated using a combination of statistical methods and geostatistical techniques of interpolation. In order to estimate the uncertainty of the method, the initial dataset is split into two parts that are, respectively, used for estimation and validation. The resulting maps are then used in the automatic outlier detection in meteorological datasets.

  18. Self-Reported Juvenile Firesetting: Results from Two National Survey Datasets

    OpenAIRE

    Howell Bowling, Carrie; Merrick, Joav; Omar, Hatim A.

    2013-01-01

    The main purpose of this study was to address gaps in existing research by examining the relationship between academic performance and attention problems with juvenile firesetting. Two datasets from the Achenbach System for Empirically Based Assessment (ASEBA) were used. The Factor Analysis Dataset (N = 975) was utilized and results indicated that adolescents who report lower academic performance are more likely to set fires. Additionally, adolescents who report a poor attitude toward school ...

  19. Self-reported juvenile firesetting: Results from two national survey datasets

    OpenAIRE

    Carrie Howell Bowling; Joav eMerrick; Joav eMerrick; Joav eMerrick; Joav eMerrick; Hatim A Omar

    2013-01-01

    The main purpose of this study was to address gaps in existing research by examining the relationship between academic performance and attention problems with juvenile firesetting. Two datasets from the Achenbach System for Empirically Based Assessment (ASEBA) were used. The Factor Analysis Dataset (N = 975) was utilized and results indicated that adolescents who report lower academic performance are more likely to set fires. Additionally, adolescents who report a poor attitude toward school...

  20. Exudate-based diabetic macular edema detection in fundus images using publicly available datasets

    Energy Technology Data Exchange (ETDEWEB)

    Giancardo, Luca [ORNL; Meriaudeau, Fabrice [ORNL; Karnowski, Thomas Paul [ORNL; Li, Yaquin [University of Tennessee, Knoxville (UTK); Garg, Seema [University of North Carolina; Tobin Jr, Kenneth William [ORNL; Chaum, Edward [University of Tennessee, Knoxville (UTK)

    2011-01-01

    Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME through the presence of exudation. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME. This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing (e.g., the classifier was trained on an independent dataset and tested on MESSIDOR). Our algorithm obtained an AUC between 0.88 and 0.94 depending on the dataset/features used. Additionally, it does not need ground truth at lesion level to reject false positives and is computationally efficient, as it generates a diagnosis on an average of 4.4 s (9.3 s, considering the optic nerve localization) per image on an 2.6 GHz platform with an unoptimized Matlab implementation.

  1. ClimateNet: A Machine Learning dataset for Climate Science Research

    Science.gov (United States)

    Prabhat, M.; Biard, J.; Ganguly, S.; Ames, S.; Kashinath, K.; Kim, S. K.; Kahou, S.; Maharaj, T.; Beckham, C.; O'Brien, T. A.; Wehner, M. F.; Williams, D. N.; Kunkel, K.; Collins, W. D.

    2017-12-01

    Deep Learning techniques have revolutionized commercial applications in Computer vision, speech recognition and control systems. The key for all of these developments was the creation of a curated, labeled dataset ImageNet, for enabling multiple research groups around the world to develop methods, benchmark performance and compete with each other. The success of Deep Learning can be largely attributed to the broad availability of this dataset. Our empirical investigations have revealed that Deep Learning is similarly poised to benefit the task of pattern detection in climate science. Unfortunately, labeled datasets, a key pre-requisite for training, are hard to find. Individual research groups are typically interested in specialized weather patterns, making it hard to unify, and share datasets across groups and institutions. In this work, we are proposing ClimateNet: a labeled dataset that provides labeled instances of extreme weather patterns, as well as associated raw fields in model and observational output. We develop a schema in NetCDF to enumerate weather pattern classes/types, store bounding boxes, and pixel-masks. We are also working on a TensorFlow implementation to natively import such NetCDF datasets, and are providing a reference convolutional architecture for binary classification tasks. Our hope is that researchers in Climate Science, as well as ML/DL, will be able to use (and extend) ClimateNet to make rapid progress in the application of Deep Learning for Climate Science research.

  2. A robust post-processing workflow for datasets with motion artifacts in diffusion kurtosis imaging.

    Science.gov (United States)

    Li, Xianjun; Yang, Jian; Gao, Jie; Luo, Xue; Zhou, Zhenyu; Hu, Yajie; Wu, Ed X; Wan, Mingxi

    2014-01-01

    The aim of this study was to develop a robust post-processing workflow for motion-corrupted datasets in diffusion kurtosis imaging (DKI). The proposed workflow consisted of brain extraction, rigid registration, distortion correction, artifacts rejection, spatial smoothing and tensor estimation. Rigid registration was utilized to correct misalignments. Motion artifacts were rejected by using local Pearson correlation coefficient (LPCC). The performance of LPCC in characterizing relative differences between artifacts and artifact-free images was compared with that of the conventional correlation coefficient in 10 randomly selected DKI datasets. The influence of rejected artifacts with information of gradient directions and b values for the parameter estimation was investigated by using mean square error (MSE). The variance of noise was used as the criterion for MSEs. The clinical practicality of the proposed workflow was evaluated by the image quality and measurements in regions of interest on 36 DKI datasets, including 18 artifact-free (18 pediatric subjects) and 18 motion-corrupted datasets (15 pediatric subjects and 3 essential tremor patients). The relative difference between artifacts and artifact-free images calculated by LPCC was larger than that of the conventional correlation coefficient (pworkflow improved the image quality and reduced the measurement biases significantly on motion-corrupted datasets (pworkflow was reliable to improve the image quality and the measurement precision of the derived parameters on motion-corrupted DKI datasets. The workflow provided an effective post-processing method for clinical applications of DKI in subjects with involuntary movements.

  3. Box photosynthesis modeling results for WRF/CMAQ LSM

    Data.gov (United States)

    U.S. Environmental Protection Agency — Box Photosynthesis model simulations for latent heat and ozone at 6 different FLUXNET sites. This dataset is associated with the following publication: Ran, L., J....

  4. Dataset of Building and Environment Publication in 2016, A reference method for measuring emissions of SVOCs in small chambers

    Data.gov (United States)

    U.S. Environmental Protection Agency — The data presented in this data file is a product of a journal publication. The dataset contains DEHP air concentrations in the emission test chamber. This dataset...

  5. Data-Driven Decision Support for Radiologists: Re-using the National Lung Screening Trial Dataset for Pulmonary Nodule Management

    OpenAIRE

    Morrison, James J.; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L.

    2014-01-01

    Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was conve...

  6. Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems

    Science.gov (United States)

    Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille

    2016-04-01

    Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not

  7. REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

    Science.gov (United States)

    Moulik, P.; Lekic, V.; Romanowicz, B. A.

    2017-12-01

    A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history

  8. FLUXCOM - Overview and First Synthesis

    Science.gov (United States)

    Jung, M.; Ichii, K.; Tramontana, G.; Camps-Valls, G.; Schwalm, C. R.; Papale, D.; Reichstein, M.; Gans, F.; Weber, U.

    2015-12-01

    We present a community effort aiming at generating an ensemble of global gridded flux products by upscaling FLUXNET data using an array of different machine learning methods including regression/model tree ensembles, neural networks, and kernel machines. We produced products for gross primary production, terrestrial ecosystem respiration, net ecosystem exchange, latent heat, sensible heat, and net radiation for two experimental protocols: 1) at a high spatial and 8-daily temporal resolution (5 arc-minute) using only remote sensing based inputs for the MODIS era; 2) 30 year records of daily, 0.5 degree spatial resolution by incorporating meteorological driver data. Within each set-up, all machine learning methods were trained with the same input data for carbon and energy fluxes respectively. Sets of input driver variables were derived using an extensive formal variable selection exercise. The performance of the extrapolation capacities of the approaches is assessed with a fully internally consistent cross-validation. We perform cross-consistency checks of the gridded flux products with independent data streams from atmospheric inversions (NEE), sun-induced fluorescence (GPP), catchment water balances (LE, H), satellite products (Rn), and process-models. We analyze the uncertainties of the gridded flux products and for example provide a breakdown of the uncertainty of mean annual GPP originating from different machine learning methods, different climate input data sets, and different flux partitioning methods. The FLUXCOM archive will provide an unprecedented source of information for water, energy, and carbon cycle studies.

  9. ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.

    Directory of Open Access Journals (Sweden)

    Douglas Teodoro

    Full Text Available The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.

  10. ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers

    Science.gov (United States)

    Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio

    2018-01-01

    The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms. PMID:29293556

  11. [Research on developping the spectral dataset for Dunhuang typical colors based on color constancy].

    Science.gov (United States)

    Liu, Qiang; Wan, Xiao-Xia; Liu, Zhen; Li, Chan; Liang, Jin-Xing

    2013-11-01

    The present paper aims at developping a method to reasonably set up the typical spectral color dataset for different kinds of Chinese cultural heritage in color rendering process. The world famous wall paintings dating from more than 1700 years ago in Dunhuang Mogao Grottoes was taken as typical case in this research. In order to maintain the color constancy during the color rendering workflow of Dunhuang culture relics, a chromatic adaptation based method for developping the spectral dataset of typical colors for those wall paintings was proposed from the view point of human vision perception ability. Under the help and guidance of researchers in the art-research institution and protection-research institution of Dunhuang Academy and according to the existing research achievement of Dunhuang Research in the past years, 48 typical known Dunhuang pigments were chosen and 240 representative color samples were made with reflective spectral ranging from 360 to 750 nm was acquired by a spectrometer. In order to find the typical colors of the above mentioned color samples, the original dataset was devided into several subgroups by clustering analysis. The grouping number, together with the most typical samples for each subgroup which made up the firstly built typical color dataset, was determined by wilcoxon signed rank test according to the color inconstancy index comprehensively calculated under 6 typical illuminating conditions. Considering the completeness of gamut of Dunhuang wall paintings, 8 complementary colors was determined and finally the typical spectral color dataset was built up which contains 100 representative spectral colors. The analytical calculating results show that the median color inconstancy index of the built dataset in 99% confidence level by wilcoxon signed rank test was 3.28 and the 100 colors are distributing in the whole gamut uniformly, which ensures that this dataset can provide reasonable reference for choosing the color with highest

  12. A procedure to validate and correct the {sup 13}C chemical shift calibration of RNA datasets

    Energy Technology Data Exchange (ETDEWEB)

    Aeschbacher, Thomas; Schubert, Mario, E-mail: schubert@mol.biol.ethz.ch; Allain, Frederic H.-T., E-mail: allain@mol.biol.ethz.ch [ETH Zuerich, Institute for Molecular Biology and Biophysics (Switzerland)

    2012-02-15

    Chemical shifts reflect the structural environment of a certain nucleus and can be used to extract structural and dynamic information. Proper calibration is indispensable to extract such information from chemical shifts. Whereas a variety of procedures exist to verify the chemical shift calibration for proteins, no such procedure is available for RNAs to date. We present here a procedure to analyze and correct the calibration of {sup 13}C NMR data of RNAs. Our procedure uses five {sup 13}C chemical shifts as a reference, each of them found in a narrow shift range in most datasets deposited in the Biological Magnetic Resonance Bank. In 49 datasets we could evaluate the {sup 13}C calibration and detect errors or inconsistencies in RNA {sup 13}C chemical shifts based on these chemical shift reference values. More than half of the datasets (27 out of those 49) were found to be improperly referenced or contained inconsistencies. This large inconsistency rate possibly explains that no clear structure-{sup 13}C chemical shift relationship has emerged for RNA so far. We were able to recalibrate or correct 17 datasets resulting in 39 usable {sup 13}C datasets. 6 new datasets from our lab were used to verify our method increasing the database to 45 usable datasets. We can now search for structure-chemical shift relationships with this improved list of {sup 13}C chemical shift data. This is demonstrated by a clear relationship between ribose {sup 13}C shifts and the sugar pucker, which can be used to predict a C2 Prime - or C3 Prime -endo conformation of the ribose with high accuracy. The improved quality of the chemical shift data allows statistical analysis with the potential to facilitate assignment procedures, and the extraction of restraints for structure calculations of RNA.

  13. Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

    Science.gov (United States)

    Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

    2017-12-01

    Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.

  14. ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.

    Science.gov (United States)

    Teodoro, Douglas; Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio

    2018-01-01

    The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.

  15. Evolving hard problems: Generating human genetics datasets with a complex etiology

    Directory of Open Access Journals (Sweden)

    Himmelstein Daniel S

    2011-07-01

    Full Text Available Abstract Background A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Results Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. Conclusions This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.

  16. Automatic registration method for multisensor datasets adopted for dimensional measurements on cutting tools

    International Nuclear Information System (INIS)

    Shaw, L; Mehari, F; Weckenmann, A; Ettl, S; Häusler, G

    2013-01-01

    Multisensor systems with optical 3D sensors are frequently employed to capture complete surface information by measuring workpieces from different views. During coarse and fine registration the resulting datasets are afterward transformed into one common coordinate system. Automatic fine registration methods are well established in dimensional metrology, whereas there is a deficit in automatic coarse registration methods. The advantage of a fully automatic registration procedure is twofold: it enables a fast and contact-free alignment and further a flexible application to datasets of any kind of optical 3D sensor. In this paper, an algorithm adapted for a robust automatic coarse registration is presented. The method was originally developed for the field of object reconstruction or localization. It is based on a segmentation of planes in the datasets to calculate the transformation parameters. The rotation is defined by the normals of three corresponding segmented planes of two overlapping datasets, while the translation is calculated via the intersection point of the segmented planes. First results have shown that the translation is strongly shape dependent: 3D data of objects with non-orthogonal planar flanks cannot be registered with the current method. In the novel supplement for the algorithm, the translation is additionally calculated via the distance between centroids of corresponding segmented planes, which results in more than one option for the transformation. A newly introduced measure considering the distance between the datasets after coarse registration evaluates the best possible transformation. Results of the robust automatic registration method are presented on the example of datasets taken from a cutting tool with a fringe-projection system and a focus-variation system. The successful application in dimensional metrology is proven with evaluations of shape parameters based on the registered datasets of a calibrated workpiece. (paper)

  17. ADAPT Dataset

    Data.gov (United States)

    National Aeronautics and Space Administration — Advanced Diagnostics and Prognostics Testbed (ADAPT) Project Lead: Scott Poll Subject Fault diagnosis in electrical power systems Description The Advanced...

  18. LIN Dataset

    Data.gov (United States)

    Department of Homeland Security — This data is captured and helps to coordinate technical and scientific testing in the area of Trade Enforcement, WMD, Intellectual Property Rights, and Narcotics...

  19. Dataset Akkerweb

    NARCIS (Netherlands)

    Baltissen, A.H.M.C.; Molendijk, L.P.G.

    2016-01-01

    This is a zipped file with various files that are distributed to be used in conjunction with a number of crop rotation decision support systems for Dutch arable farming that can be found at www.akkerweb.nl. They come in different formats including Microsoft Excel files, Microsoft Powerpoint fies,

  20. Introducing a Web API for Dataset Submission into a NASA Earth Science Data Center

    Science.gov (United States)

    Moroni, D. F.; Quach, N.; Francis-Curley, W.

    2016-12-01

    As the landscape of data becomes increasingly more diverse in the domain of Earth Science, the challenges of managing and preserving data become more onerous and complex, particularly for data centers on fixed budgets and limited staff. Many solutions already exist to ease the cost burden for the downstream component of the data lifecycle, yet most archive centers are still racing to keep up with the influx of new data that still needs to find a quasi-permanent resting place. For instance, having well-defined metadata that is consistent across the entire data landscape provides for well-managed and preserved datasets throughout the latter end of the data lifecycle. Translators between different metadata dialects are already in operational use, and facilitate keeping older datasets relevant in today's world of rapidly evolving metadata standards. However, very little is done to address the first phase of the lifecycle, which deals with the entry of both data and the corresponding metadata into a system that is traditionally opaque and closed off to external data producers, thus resulting in a significant bottleneck to the dataset submission process. The ATRAC system was the NOAA NCEI's answer to this previously obfuscated barrier to scientists wishing to find a home for their climate data records, providing a web-based entry point to submit timely and accurate metadata and information about a very specific dataset. A couple of NASA's Distributed Active Archive Centers (DAACs) have implemented their own versions of a web-based dataset and metadata submission form including the ASDC and the ORNL DAAC. The Physical Oceanography DAAC is the most recent in the list of NASA-operated DAACs who have begun to offer their own web-based dataset and metadata submission services to data producers. What makes the PO.DAAC dataset and metadata submission service stand out from these pre-existing services is the option of utilizing both a web browser GUI and a RESTful API to

  1. Hamiltonian Algorithm Sound Synthesis

    OpenAIRE

    大矢, 健一

    2013-01-01

    Hamiltonian Algorithm (HA) is an algorithm for searching solutions is optimization problems. This paper introduces a sound synthesis technique using Hamiltonian Algorithm and shows a simple example. "Hamiltonian Algorithm Sound Synthesis" uses phase transition effect in HA. Because of this transition effect, totally new waveforms are produced.

  2. Synthesis of Mechanisms

    DEFF Research Database (Denmark)

    Hansen, John Michael

    1999-01-01

    These notes describe an automated procedure for analysis and synthesis of mechanisms. The analysis method is based on the body coordinate formulation, and the synthesis is based on applying optimization methods, used to minimize the difference between an actual and a desired behaviour...

  3. Synthesis of oligonucleotide phosphorodithioates

    DEFF Research Database (Denmark)

    Beaton, G.; Brill, W. K D; Grandas, A.

    1991-01-01

    The synthesis of DNA containing sulfur at the two nonbonding internucleotide valencies is reported. Several different routes using either tervalent or pentavalent mononucleotide synthons are described.......The synthesis of DNA containing sulfur at the two nonbonding internucleotide valencies is reported. Several different routes using either tervalent or pentavalent mononucleotide synthons are described....

  4. Synthesis of Isoiminosugars

    DEFF Research Database (Denmark)

    Hyldtoft, Lene; Godskesen, Michael Anders; Lundt, Inge

    1998-01-01

    A short synthesis of isoiminosugars have been developed. Bromolactones are diastereoselectively alkylated at C-2 followed by ring closure to the corresponding lactams. Reduction of these then gives isoiminosugars......A short synthesis of isoiminosugars have been developed. Bromolactones are diastereoselectively alkylated at C-2 followed by ring closure to the corresponding lactams. Reduction of these then gives isoiminosugars...

  5. Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

    Science.gov (United States)

    Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene

    2018-01-01

    Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie

  6. A method for generating large datasets of organ geometries for radiotherapy treatment planning studies

    International Nuclear Information System (INIS)

    Hu, Nan; Cerviño, Laura; Segars, Paul; Lewis, John; Shan, Jinlu; Jiang, Steve; Zheng, Xiaolin; Wang, Ge

    2014-01-01

    With the rapidly increasing application of adaptive radiotherapy, large datasets of organ geometries based on the patient’s anatomy are desired to support clinical application or research work, such as image segmentation, re-planning, and organ deformation analysis. Sometimes only limited datasets are available in clinical practice. In this study, we propose a new method to generate large datasets of organ geometries to be utilized in adaptive radiotherapy. Given a training dataset of organ shapes derived from daily cone-beam CT, we align them into a common coordinate frame and select one of the training surfaces as reference surface. A statistical shape model of organs was constructed, based on the establishment of point correspondence between surfaces and non-uniform rational B-spline (NURBS) representation. A principal component analysis is performed on the sampled surface points to capture the major variation modes of each organ. A set of principal components and their respective coefficients, which represent organ surface deformation, were obtained, and a statistical analysis of the coefficients was performed. New sets of statistically equivalent coefficients can be constructed and assigned to the principal components, resulting in a larger geometry dataset for the patient’s organs. These generated organ geometries are realistic and statistically representative

  7. Anonymising the Sparse Dataset: A New Privacy Preservation Approach while Predicting Diseases

    Directory of Open Access Journals (Sweden)

    V. Shyamala Susan

    2016-09-01

    Full Text Available Data mining techniques analyze the medical dataset with the intention of enhancing patient’s health and privacy. Most of the existing techniques are properly suited for low dimensional medical dataset. The proposed methodology designs a model for the representation of sparse high dimensional medical dataset with the attitude of protecting the patient’s privacy from an adversary and additionally to predict the disease’s threat degree. In a sparse data set many non-zero values are randomly spread in the entire data space. Hence, the challenge is to cluster the correlated patient’s record to predict the risk degree of the disease earlier than they occur in patients and to keep privacy. The first phase converts the sparse dataset right into a band matrix through the Genetic algorithm along with Cuckoo Search (GCS.This groups the correlated patient’s record together and arranges them close to the diagonal. The next segment dissociates the patient’s disease, which is a sensitive value (SA with the parameters that determine the disease normally Quasi Identifier (QI.Finally, density based clustering technique is used over the underlying data to  create anonymized groups to maintain privacy and to predict the risk level of disease. Empirical assessments on actual health care data corresponding to V.A.Medical Centre heart disease dataset reveal the efficiency of this model pertaining to information loss, utility and privacy.

  8. Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets

    Science.gov (United States)

    Fitriah Rusland, Nurul; Wahid, Norfaradilla; Kasim, Shahreen; Hafit, Hanayanti

    2017-08-01

    E-mail spam continues to become a problem on the Internet. Spammed e-mail may contain many copies of the same message, commercial advertisement or other irrelevant posts like pornographic content. In previous research, different filtering techniques are used to detect these e-mails such as using Random Forest, Naïve Bayesian, Support Vector Machine (SVM) and Neutral Network. In this research, we test Naïve Bayes algorithm for e-mail spam filtering on two datasets and test its performance, i.e., Spam Data and SPAMBASE datasets [8]. The performance of the datasets is evaluated based on their accuracy, recall, precision and F-measure. Our research use WEKA tool for the evaluation of Naïve Bayes algorithm for e-mail spam filtering on both datasets. The result shows that the type of email and the number of instances of the dataset has an influence towards the performance of Naïve Bayes.

  9. Review of ATLAS Open Data 8 TeV datasets, tools and activities

    CERN Document Server

    The ATLAS collaboration

    2018-01-01

    The ATLAS Collaboration has released two 8 TeV datasets and relevant simulated samples to the public for educational use. A number of groups within ATLAS have used these ATLAS Open Data 8 TeV datasets, developing tools and educational material to promote particle physics. The general aim of these activities is to provide simple and user-friendly interactive interfaces to simulate the procedures used by high-energy physics researchers. International Masterclasses introduce particle physics to high school students and have been studying 8 TeV ATLAS Open Data since 2015. Inspired by this success, a new ATLAS Open Data initiative was launched in 2016 for university students. A comprehensive educational platform was thus developed featuring a second 8 TeV dataset and a new set of educational tools. The 8 TeV datasets and associated tools are presented and discussed here, as well as a selection of activities studying the ATLAS Open Data 8 TeV datasets.

  10. A multi-environment dataset for activity of daily living recognition in video streams.

    Science.gov (United States)

    Borreo, Alessandro; Onofri, Leonardo; Soda, Paolo

    2015-08-01

    Public datasets played a key role in the increasing level of interest that vision-based human action recognition has attracted in last years. While the production of such datasets has been influenced by the variability introduced by various actors performing the actions, the different modalities of interactions with the environment introduced by the variation of the scenes around the actors has been scarcely took into account. As a consequence, public datasets do not provide a proper test-bed for recognition algorithms that aim at achieving high accuracy, irrespective of the environment where actions are performed. This is all the more so, when systems are designed to recognize activities of daily living (ADL), which are characterized by a high level of human-environment interaction. For that reason, we present in this manuscript the MEA dataset, a new multi-environment ADL dataset, which permitted us to show how the change of scenario can affect the performances of state-of-the-art approaches for action recognition.

  11. MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark

    Science.gov (United States)

    Qin, Li-Xuan; Zhou, Qin

    2014-01-01

    MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456

  12. An innovative privacy preserving technique for incremental datasets on cloud computing.

    Science.gov (United States)

    Aldeen, Yousra Abdul Alsahib S; Salleh, Mazleena; Aljeroudi, Yazan

    2016-08-01

    Cloud computing (CC) is a magnificent service-based delivery with gigantic computer processing power and data storage across connected communications channels. It imparted overwhelming technological impetus in the internet (web) mediated IT industry, where users can easily share private data for further analysis and mining. Furthermore, user affable CC services enable to deploy sundry applications economically. Meanwhile, simple data sharing impelled various phishing attacks and malware assisted security threats. Some privacy sensitive applications like health services on cloud that are built with several economic and operational benefits necessitate enhanced security. Thus, absolute cyberspace security and mitigation against phishing blitz became mandatory to protect overall data privacy. Typically, diverse applications datasets are anonymized with better privacy to owners without providing all secrecy requirements to the newly added records. Some proposed techniques emphasized this issue by re-anonymizing the datasets from the scratch. The utmost privacy protection over incremental datasets on CC is far from being achieved. Certainly, the distribution of huge datasets volume across multiple storage nodes limits the privacy preservation. In this view, we propose a new anonymization technique to attain better privacy protection with high data utility over distributed and incremental datasets on CC. The proficiency of data privacy preservation and improved confidentiality requirements is demonstrated through performance evaluation. Copyright © 2016 Elsevier Inc. All rights reserved.

  13. A public dataset of overground and treadmill walking kinematics and kinetics in healthy individuals

    Directory of Open Access Journals (Sweden)

    Claudiane A. Fukuchi

    2018-04-01

    Full Text Available In a typical clinical gait analysis, the gait patterns of pathological individuals are commonly compared with the typically faster, comfortable pace of healthy subjects. However, due to potential bias related to gait speed, this comparison may not be valid. Publicly available gait datasets have failed to address this issue. Therefore, the goal of this study was to present a publicly available dataset of 42 healthy volunteers (24 young adults and 18 older adults who walked both overground and on a treadmill at a range of gait speeds. Their lower-extremity and pelvis kinematics were measured using a three-dimensional (3D motion-capture system. The external forces during both overground and treadmill walking were collected using force plates and an instrumented treadmill, respectively. The results include both raw and processed kinematic and kinetic data in different file formats: c3d and ASCII files. In addition, a metadata file is provided that contain demographic and anthropometric data and data related to each file in the dataset. All data are available at Figshare (DOI: 10.6084/m9.figshare.5722711. We foresee several applications of this public dataset, including to examine the influences of speed, age, and environment (overground vs. treadmill on gait biomechanics, to meet educational needs, and, with the inclusion of additional participants, to use as a normative dataset.

  14. Integrated biofuels process synthesis

    DEFF Research Database (Denmark)

    Torres-Ortega, Carlo Edgar; Rong, Ben-Guang

    2017-01-01

    Second and third generation bioethanol and biodiesel are more environmentally friendly fuels than gasoline and petrodiesel, andmore sustainable than first generation biofuels. However, their production processes are more complex and more expensive. In this chapter, we describe a two-stage synthesis......% used for bioethanol process), and steam and electricity from combustion (54%used as electricity) in the bioethanol and biodiesel processes. In the second stage, we saved about 5% in equipment costs and 12% in utility costs for bioethanol separation. This dual synthesis methodology, consisting of a top......-level screening task followed by a down-level intensification task, proved to be an efficient methodology for integrated biofuel process synthesis. The case study illustrates and provides important insights into the optimal synthesis and intensification of biofuel production processes with the proposed synthesis...

  15. VHDL for logic synthesis

    CERN Document Server

    Rushton, Andrew

    2011-01-01

    Many engineers encountering VHDL (very high speed integrated circuits hardware description language) for the first time can feel overwhelmed by it. This book bridges the gap between the VHDL language and the hardware that results from logic synthesis with clear organisation, progressing from the basics of combinational logic, types, and operators; through special structures such as tristate buses, register banks and memories, to advanced themes such as developing your own packages, writing test benches and using the full range of synthesis types. This third edition has been substantially rewritten to include the new VHDL-2008 features that enable synthesis of fixed-point and floating-point hardware. Extensively updated throughout to reflect modern logic synthesis usage, it also contains a complete case study to demonstrate the updated features. Features to this edition include: * a common VHDL subset which will work across a range of different synthesis systems, targeting a very wide range of technologies...

  16. Immersive Interaction, Manipulation and Analysis of Large 3D Datasets for Planetary and Earth Sciences

    Science.gov (United States)

    Pariser, O.; Calef, F.; Manning, E. M.; Ardulov, V.

    2017-12-01

    We will present implementation and study of several use-cases of utilizing Virtual Reality (VR) for immersive display, interaction and analysis of large and complex 3D datasets. These datasets have been acquired by the instruments across several Earth, Planetary and Solar Space Robotics Missions. First, we will describe the architecture of the common application framework that was developed to input data, interface with VR display devices and program input controllers in various computing environments. Tethered and portable VR technologies will be contrasted and advantages of each highlighted. We'll proceed to presenting experimental immersive analytics visual constructs that enable augmentation of 3D datasets with 2D ones such as images and statistical and abstract data. We will conclude by presenting comparative analysis with traditional visualization applications and share the feedback provided by our users: scientists and engineers.

  17. Common integration sites of published datasets identified using a graph-based framework

    Directory of Open Access Journals (Sweden)

    Alessandro Vasciaveo

    2016-01-01

    Full Text Available With next-generation sequencing, the genomic data available for the characterization of integration sites (IS has dramatically increased. At present, in a single experiment, several thousand viral integration genome targets can be investigated to define genomic hot spots. In a previous article, we renovated a formal CIS analysis based on a rigid fixed window demarcation into a more stretchy definition grounded on graphs. Here, we present a selection of supporting data related to the graph-based framework (GBF from our previous article, in which a collection of common integration sites (CIS was identified on six published datasets. In this work, we will focus on two datasets, ISRTCGD and ISHIV, which have been previously discussed. Moreover, we show in more detail the workflow design that originates the datasets.

  18. Visual Comparison of Multiple Gene Expression Datasets in a Genomic Context

    Directory of Open Access Journals (Sweden)

    Borowski Krzysztof

    2008-06-01

    Full Text Available The need for novel methods of visualizing microarray data is growing. New perspectives are beneficial to finding patterns in expression data. The Bluejay genome browser provides an integrative way of visualizing gene expression datasets in a genomic context. We have now developed the functionality to display multiple microarray datasets simultaneously in Bluejay, in order to provide researchers with a comprehensive view of their datasets linked to a graphical representation of gene function. This will enable biologists to obtain valuable insights on expression patterns, by allowing them to analyze the expression values in relation to the gene locations as well as to compare expression profiles of related genomes or of di erent experiments for the same genome.

  19. Check your biosignals here: a new dataset for off-the-person ECG biometrics.

    Science.gov (United States)

    da Silva, Hugo Plácido; Lourenço, André; Fred, Ana; Raposo, Nuno; Aires-de-Sousa, Marta

    2014-02-01

    The Check Your Biosignals Here initiative (CYBHi) was developed as a way of creating a dataset and consistently repeatable acquisition framework, to further extend research in electrocardiographic (ECG) biometrics. In particular, our work targets the novel trend towards off-the-person data acquisition, which opens a broad new set of challenges and opportunities both for research and industry. While datasets with ECG signals collected using medical grade equipment at the chest can be easily found, for off-the-person ECG data the solution is generally for each team to collect their own corpus at considerable expense of resources. In this paper we describe the context, experimental considerations, methods, and preliminary findings of two public datasets created by our team, one for short-term and another for long-term assessment, with ECG data collected at the hand palms and fingers. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  20. Boundary expansion algorithm of a decision tree induction for an imbalanced dataset

    Directory of Open Access Journals (Sweden)

    Kesinee Boonchuay

    2017-10-01

    Full Text Available A decision tree is one of the famous classifiers based on a recursive partitioning algorithm. This paper introduces the Boundary Expansion Algorithm (BEA to improve a decision tree induction that deals with an imbalanced dataset. BEA utilizes all attributes to define non-splittable ranges. The computed means of all attributes for minority instances are used to find the nearest minority instance, which will be expanded along all attributes to cover a minority region. As a result, BEA can successfully cope with an imbalanced dataset comparing with C4.5, Gini, asymmetric entropy, top-down tree, and Hellinger distance decision tree on 25 imbalanced datasets from the UCI Repository.

  1. Long Term Cloud Property Datasets From MODIS and AVHRR Using the CERES Cloud Algorithm

    Science.gov (United States)

    Minnis, Patrick; Bedka, Kristopher M.; Doelling, David R.; Sun-Mack, Sunny; Yost, Christopher R.; Trepte, Qing Z.; Bedka, Sarah T.; Palikonda, Rabindra; Scarino, Benjamin R.; Chen, Yan; hide

    2015-01-01

    Cloud properties play a critical role in climate change. Monitoring cloud properties over long time periods is needed to detect changes and to validate and constrain models. The Clouds and the Earth's Radiant Energy System (CERES) project has developed several cloud datasets from Aqua and Terra MODIS data to better interpret broadband radiation measurements and improve understanding of the role of clouds in the radiation budget. The algorithms applied to MODIS data have been adapted to utilize various combinations of channels on the Advanced Very High Resolution Radiometer (AVHRR) on the long-term time series of NOAA and MetOp satellites to provide a new cloud climate data record. These datasets can be useful for a variety of studies. This paper presents results of the MODIS and AVHRR analyses covering the period from 1980-2014. Validation and comparisons with other datasets are also given.

  2. Long-term dataset on aquatic responses to concurrent climate change and recovery from acidification

    Science.gov (United States)

    Leach, Taylor H.; Winslow, Luke A.; Acker, Frank W.; Bloomfield, Jay A.; Boylen, Charles W.; Bukaveckas, Paul A.; Charles, Donald F.; Daniels, Robert A.; Driscoll, Charles T.; Eichler, Lawrence W.; Farrell, Jeremy L.; Funk, Clara S.; Goodrich, Christine A.; Michelena, Toby M.; Nierzwicki-Bauer, Sandra A.; Roy, Karen M.; Shaw, William H.; Sutherland, James W.; Swinton, Mark W.; Winkler, David A.; Rose, Kevin C.

    2018-04-01

    Concurrent regional and global environmental changes are affecting freshwater ecosystems. Decadal-scale data on lake ecosystems that can describe processes affected by these changes are important as multiple stressors often interact to alter the trajectory of key ecological phenomena in complex ways. Due to the practical challenges associated with long-term data collections, the majority of existing long-term data sets focus on only a small number of lakes or few response variables. Here we present physical, chemical, and biological data from 28 lakes in the Adirondack Mountains of northern New York State. These data span the period from 1994-2012 and harmonize multiple open and as-yet unpublished data sources. The dataset creation is reproducible and transparent; R code and all original files used to create the dataset are provided in an appendix. This dataset will be useful for examining ecological change in lakes undergoing multiple stressors.

  3. Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A Normative Database Created from Control Datasets.

    Science.gov (United States)

    de Vent, Nathalie R; Agelink van Rentergem, Joost A; Schmand, Ben A; Murre, Jaap M J; Huizenga, Hilde M

    2016-01-01

    In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity, and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given.

  4. Microscopy Image Browser: A Platform for Segmentation and Analysis of Multidimensional Datasets.

    Directory of Open Access Journals (Sweden)

    Ilya Belevich

    2016-01-01

    Full Text Available Understanding the structure-function relationship of cells and organelles in their natural context requires multidimensional imaging. As techniques for multimodal 3-D imaging have become more accessible, effective processing, visualization, and analysis of large datasets are posing a bottleneck for the workflow. Here, we present a new software package for high-performance segmentation and image processing of multidimensional datasets that improves and facilitates the full utilization and quantitative analysis of acquired data, which is freely available from a dedicated website. The open-source environment enables modification and insertion of new plug-ins to customize the program for specific needs. We provide practical examples of program features used for processing, segmentation and analysis of light and electron microscopy datasets, and detailed tutorials to enable users to rapidly and thoroughly learn how to use the program.

  5. Synthetic ALSPAC longitudinal datasets for the Big Data VR project [version 1; referees: 2 approved

    Directory of Open Access Journals (Sweden)

    Demetris Avraam

    2017-08-01

    Full Text Available Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables without including the original data itself or disclosing participant information.  In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.

  6. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    Science.gov (United States)

    Jensen, Tue V.; Pinson, Pierre

    2017-11-01

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.

  7. Correction of elevation offsets in multiple co-located lidar datasets

    Science.gov (United States)

    Thompson, David M.; Dalyander, P. Soupy; Long, Joseph W.; Plant, Nathaniel G.

    2017-04-07

    IntroductionTopographic elevation data collected with airborne light detection and ranging (lidar) can be used to analyze short- and long-term changes to beach and dune systems. Analysis of multiple lidar datasets at Dauphin Island, Alabama, revealed systematic, island-wide elevation differences on the order of 10s of centimeters (cm) that were not attributable to real-world change and, therefore, were likely to represent systematic sampling offsets. These offsets vary between the datasets, but appear spatially consistent within a given survey. This report describes a method that was developed to identify and correct offsets between lidar datasets collected over the same site at different times so that true elevation changes over time, associated with sediment accumulation or erosion, can be analyzed.

  8. Advanced Neuropsychological Diagnostics Infrastructure (ANDI: A Normative Database Created from Control Datasets.

    Directory of Open Access Journals (Sweden)

    Nathalie R. de Vent

    2016-10-01

    Full Text Available In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI, datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given.

  9. Supervised Variational Relevance Learning, An Analytic Geometric Feature Selection with Applications to Omic Datasets.

    Science.gov (United States)

    Boareto, Marcelo; Cesar, Jonatas; Leite, Vitor B P; Caticha, Nestor

    2015-01-01

    We introduce Supervised Variational Relevance Learning (Suvrel), a variational method to determine metric tensors to define distance based similarity in pattern classification, inspired in relevance learning. The variational method is applied to a cost function that penalizes large intraclass distances and favors small interclass distances. We find analytically the metric tensor that minimizes the cost function. Preprocessing the patterns by doing linear transformations using the metric tensor yields a dataset which can be more efficiently classified. We test our methods using publicly available datasets, for some standard classifiers. Among these datasets, two were tested by the MAQC-II project and, even without the use of further preprocessing, our results improve on their performance.

  10. Dataset on information strategies for energy conservation: A field experiment in India.

    Science.gov (United States)

    Chen, Victor L; Delmas, Magali A; Locke, Stephen L; Singh, Amarjeet

    2018-02-01

    The data presented in this article are related to the research article entitled: "Information strategies for energy conservation: a field experiment in India" (Chen et al., 2017) [1]. The availability of high-resolution electricity data offers benefits to both utilities and consumers to understand the dynamics of energy consumption for example, between billing periods or times of peak demand. However, few public datasets with high-temporal resolution have been available to researchers on electricity use, especially at the appliance-level. This article describes data collected in a residential field experiment for 19 apartments at an Indian faculty housing complex during the period from August 1, 2013 to May 12, 2014. The dataset includes detailed information about electricity consumption. It also includes information on apartment characteristics and hourly weather variation to enable further studies of energy performance. These data can be used by researchers as training datasets to evaluate electricity usage consumption.

  11. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system.

    Science.gov (United States)

    Jensen, Tue V; Pinson, Pierre

    2017-11-28

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.

  12. Dataset Preservation for the Long Term: Results of the DareLux Project

    Directory of Open Access Journals (Sweden)

    Eugène Dürr

    2008-08-01

    Full Text Available The purpose of the DareLux (Data Archiving River Environment Luxembourg Project was the preservation of unique and irreplaceable datasets, for which we chose hydrology data that will be required to be used in future climatic models. The results are: an operational archive built with XML containers, the OAI-PMH protocol and an architecture based upon web services. Major conclusions are: quality control on ingest is important; digital rights management demands attention; and cost aspects of ingest and retrieval cannot be underestimated. We propose a new paradigm for information retrieval of this type of dataset. We recommend research into visualisation tools for the search and retrieval of this type of dataset.

  13. A first dataset toward a standardized community-driven global mapping of the human immunopeptidome

    Directory of Open Access Journals (Sweden)

    Pouya Faridi

    2016-06-01

    Full Text Available We present the first standardized HLA peptidomics dataset generated by the immunopeptidomics community. The dataset is composed of native HLA class I peptides as well as synthetic HLA class II peptides that were acquired in data-dependent acquisition mode using multiple types of mass spectrometers. All laboratories used the spiked-in landmark iRT peptides for retention time normalization and data analysis. The mass spectrometric data were deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier http://www.ebi.ac.uk/pride/archive/projects/PXD001872. The generated data were used to build HLA allele-specific peptide spectral and assay libraries, which were stored in the SWATHAtlas database. Data presented here are described in more detail in the original eLife article entitled ‘An open-source computational and data resource to analyze digital maps of immunopeptidomes’.

  14. The Problem with Big Data: Operating on Smaller Datasets to Bridge the Implementation Gap.

    Science.gov (United States)

    Mann, Richard P; Mushtaq, Faisal; White, Alan D; Mata-Cervantes, Gabriel; Pike, Tom; Coker, Dalton; Murdoch, Stuart; Hiles, Tim; Smith, Clare; Berridge, David; Hinchliffe, Suzanne; Hall, Geoff; Smye, Stephen; Wilkie, Richard M; Lodge, J Peter A; Mon-Williams, Mark

    2016-01-01

    Big datasets have the potential to revolutionize public health. However, there is a mismatch between the political and scientific optimism surrounding big data and the public's perception of its benefit. We suggest a systematic and concerted emphasis on developing models derived from smaller datasets to illustrate to the public how big data can produce tangible benefits in the long term. In order to highlight the immediate value of a small data approach, we produced a proof-of-concept model predicting hospital length of stay. The results demonstrate that existing small datasets can be used to create models that generate a reasonable prediction, facilitating health-care delivery. We propose that greater attention (and funding) needs to be directed toward the utilization of existing information resources in parallel with current efforts to create and exploit "big data."

  15. fCCAC: functional canonical correlation analysis to evaluate covariance between nucleic acid sequencing datasets.

    Science.gov (United States)

    Madrigal, Pedro

    2017-03-01

    Computational evaluation of variability across DNA or RNA sequencing datasets is a crucial step in genomic science, as it allows both to evaluate reproducibility of biological or technical replicates, and to compare different datasets to identify their potential correlations. Here we present fCCAC, an application of functional canonical correlation analysis to assess covariance of nucleic acid sequencing datasets such as chromatin immunoprecipitation followed by deep sequencing (ChIP-seq). We show how this method differs from other measures of correlation, and exemplify how it can reveal shared covariance between histone modifications and DNA binding proteins, such as the relationship between the H3K4me3 chromatin mark and its epigenetic writers and readers. An R/Bioconductor package is available at http://bioconductor.org/packages/fCCAC/ . pmb59@cam.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  16. Full-Scale Approximations of Spatio-Temporal Covariance Models for Large Datasets

    KAUST Repository

    Zhang, Bohai

    2014-01-01

    Various continuously-indexed spatio-temporal process models have been constructed to characterize spatio-temporal dependence structures, but the computational complexity for model fitting and predictions grows in a cubic order with the size of dataset and application of such models is not feasible for large datasets. This article extends the full-scale approximation (FSA) approach by Sang and Huang (2012) to the spatio-temporal context to reduce computational complexity. A reversible jump Markov chain Monte Carlo (RJMCMC) algorithm is proposed to select knots automatically from a discrete set of spatio-temporal points. Our approach is applicable to nonseparable and nonstationary spatio-temporal covariance models. We illustrate the effectiveness of our method through simulation experiments and application to an ozone measurement dataset.

  17. Multimedia Content Development as a Facial Expression Datasets for Recognition of Human Emotions

    Science.gov (United States)

    Mamonto, N. E.; Maulana, H.; Liliana, D. Y.; Basaruddin, T.

    2018-02-01

    Datasets that have been developed before contain facial expression from foreign people. The development of multimedia content aims to answer the problems experienced by the research team and other researchers who will conduct similar research. The method used in the development of multimedia content as facial expression datasets for human emotion recognition is the Villamil-Molina version of the multimedia development method. Multimedia content developed with 10 subjects or talents with each talent performing 3 shots with each capturing talent having to demonstrate 19 facial expressions. After the process of editing and rendering, tests are carried out with the conclusion that the multimedia content can be used as a facial expression dataset for recognition of human emotions.

  18. Valuation of large variable annuity portfolios: Monte Carlo simulation and synthetic datasets

    Directory of Open Access Journals (Sweden)

    Gan Guojun

    2017-12-01

    Full Text Available Metamodeling techniques have recently been proposed to address the computational issues related to the valuation of large portfolios of variable annuity contracts. However, it is extremely diffcult, if not impossible, for researchers to obtain real datasets frominsurance companies in order to test their metamodeling techniques on such real datasets and publish the results in academic journals. To facilitate the development and dissemination of research related to the effcient valuation of large variable annuity portfolios, this paper creates a large synthetic portfolio of variable annuity contracts based on the properties of real portfolios of variable annuities and implements a simple Monte Carlo simulation engine for valuing the synthetic portfolio. In addition, this paper presents fair market values and Greeks for the synthetic portfolio of variable annuity contracts that are important quantities for managing the financial risks associated with variable annuities. The resulting datasets can be used by researchers to test and compare the performance of various metamodeling techniques.

  19. Genome-wide gene expression dataset used to identify potential therapeutic targets in androgenetic alopecia

    Directory of Open Access Journals (Sweden)

    R. Dey-Rao

    2017-08-01

    Full Text Available The microarray dataset attached to this report is related to the research article with the title: “A genomic approach to susceptibility and pathogenesis leads to identifying potential novel therapeutic targets in androgenetic alopecia” (Dey-Rao and Sinha, 2017 [1]. Male-pattern hair loss that is induced by androgens (testosterone in genetically predisposed individuals is known as androgenetic alopecia (AGA. The raw dataset is being made publicly available to enable critical and/or extended analyses. Our related research paper utilizes the attached raw dataset, for genome-wide gene-expression associated investigations. Combined with several in silico bioinformatics-based analyses we were able to delineate five strategic molecular elements as potential novel targets towards future AGA-therapy.

  20. Handling limited datasets with neural networks in medical applications: A small-data approach.

    Science.gov (United States)

    Shaikhina, Torgyn; Khovanova, Natalia A

    2017-01-01

    Single-centre studies in medical domain are often characterised by limited samples due to the complexity and high costs of patient data collection. Machine learning methods for regression modelling of small datasets (less than 10 observations per predictor variable) remain scarce. Our work bridges this gap by developing a novel framework for application of artificial neural networks (NNs) for regression tasks involving small medical datasets. In order to address the sporadic fluctuations and validation issues that appear in regression NNs trained on small datasets, the method of multiple runs and surrogate data analysis were proposed in this work. The approach was compared to the state-of-the-art ensemble NNs; the effect of dataset size on NN performance was also investigated. The proposed framework was applied for the prediction of compressive strength (CS) of femoral trabecular bone in patients suffering from severe osteoarthritis. The NN model was able to estimate the CS of osteoarthritic trabecular bone from its structural and biological properties with a standard error of 0.85MPa. When evaluated on independent test samples, the NN achieved accuracy of 98.3%, outperforming an ensemble NN model by 11%. We reproduce this result on CS data of another porous solid (concrete) and demonstrate that the proposed framework allows for an NN modelled with as few as 56 samples to generalise on 300 independent test samples with 86.5% accuracy, which is comparable to the performance of an NN developed with 18 times larger dataset (1030 samples). The significance of this work is two-fold: the practical application allows for non-destructive prediction of bone fracture risk, while the novel methodology extends beyond the task considered in this study and provides a general framework for application of regression NNs to medical problems characterised by limited dataset sizes. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.

  1. Annotating spatio-temporal datasets for meaningful analysis in the Web

    Science.gov (United States)

    Stasch, Christoph; Pebesma, Edzer; Scheider, Simon

    2014-05-01

    More and more environmental datasets that vary in space and time are available in the Web. This comes along with an advantage of using the data for other purposes than originally foreseen, but also with the danger that users may apply inappropriate analysis procedures due to lack of important assumptions made during the data collection process. In order to guide towards a meaningful (statistical) analysis of spatio-temporal datasets available in the Web, we have developed a Higher-Order-Logic formalism that captures some relevant assumptions in our previous work [1]. It allows to proof on meaningful spatial prediction and aggregation in a semi-automated fashion. In this poster presentation, we will present a concept for annotating spatio-temporal datasets available in the Web with concepts defined in our formalism. Therefore, we have defined a subset of the formalism as a Web Ontology Language (OWL) pattern. It allows capturing the distinction between the different spatio-temporal variable types, i.e. point patterns, fields, lattices and trajectories, that in turn determine whether a particular dataset can be interpolated or aggregated in a meaningful way using a certain procedure. The actual annotations that link spatio-temporal datasets with the concepts in the ontology pattern are provided as Linked Data. In order to allow data producers to add the annotations to their datasets, we have implemented a Web portal that uses a triple store at the backend to store the annotations and to make them available in the Linked Data cloud. Furthermore, we have implemented functions in the statistical environment R to retrieve the RDF annotations and, based on these annotations, to support a stronger typing of spatio-temporal datatypes guiding towards a meaningful analysis in R. [1] Stasch, C., Scheider, S., Pebesma, E., Kuhn, W. (2014): "Meaningful spatial prediction and aggregation", Environmental Modelling & Software, 51, 149-165.

  2. Comparing the accuracy of food outlet datasets in an urban environment

    Directory of Open Access Journals (Sweden)

    Michelle S. Wong

    2017-05-01

    Full Text Available Studies that investigate the relationship between the retail food environment and health outcomes often use geospatial datasets. Prior studies have identified challenges of using the most common data sources. Retail food environment datasets created through academic-government partnership present an alternative, but their validity (retail existence, type, location has not been assessed yet. In our study, we used ground-truth data to compare the validity of two datasets, a 2015 commercial dataset (InfoUSA and data collected from 2012 to 2014 through the Maryland Food Systems Mapping Project (MFSMP, an academic-government partnership, on the retail food environment in two low-income, inner city neighbourhoods in Baltimore City. We compared sensitivity and positive predictive value (PPV of the commercial and academic-government partnership data to ground-truth data for two broad categories of unhealthy food retailers: small food retailers and quick-service restaurants. Ground-truth data was collected in 2015 and analysed in 2016. Compared to the ground-truth data, MFSMP and InfoUSA generally had similar sensitivity that was greater than 85%. MFSMP had higher PPV compared to InfoUSA for both small food retailers (MFSMP: 56.3% vs InfoUSA: 40.7% and quick-service restaurants (MFSMP: 58.6% vs InfoUSA: 36.4%. We conclude that data from academic-government partnerships like MFSMP might be an attractive alternative option and improvement to relying only on commercial data. Other research institutes or cities might consider efforts to create and maintain such an environmental dataset. Even if these datasets cannot be updated on an annual basis, they are likely more accurate than commercial data.

  3. Comparing the accuracy of food outlet datasets in an urban environment.

    Science.gov (United States)

    Wong, Michelle S; Peyton, Jennifer M; Shields, Timothy M; Curriero, Frank C; Gudzune, Kimberly A

    2017-05-11

    Studies that investigate the relationship between the retail food environment and health outcomes often use geospatial datasets. Prior studies have identified challenges of using the most common data sources. Retail food environment datasets created through academic-government partnership present an alternative, but their validity (retail existence, type, location) has not been assessed yet. In our study, we used ground-truth data to compare the validity of two datasets, a 2015 commercial dataset (InfoUSA) and data collected from 2012 to 2014 through the Maryland Food Systems Mapping Project (MFSMP), an academic-government partnership, on the retail food environment in two low-income, inner city neighbourhoods in Baltimore City. We compared sensitivity and positive predictive value (PPV) of the commercial and academic-government partnership data to ground-truth data for two broad categories of unhealthy food retailers: small food retailers and quick-service restaurants. Ground-truth data was collected in 2015 and analysed in 2016. Compared to the ground-truth data, MFSMP and InfoUSA generally had similar sensitivity that was greater than 85%. MFSMP had higher PPV compared to InfoUSA for both small food retailers (MFSMP: 56.3% vs InfoUSA: 40.7%) and quick-service restaurants (MFSMP: 58.6% vs InfoUSA: 36.4%). We conclude that data from academic-government partnerships like MFSMP might be an attractive alternative option and improvement to relying only on commercial data. Other research institutes or cities might consider efforts to create and maintain such an environmental dataset. Even if these datasets cannot be updated on an annual basis, they are likely more accurate than commercial data.

  4. The impact of the resolution of meteorological datasets on catchment-scale drought studies

    Science.gov (United States)

    Hellwig, Jost; Stahl, Kerstin

    2017-04-01

    Gridded meteorological datasets provide the basis to study drought at a range of scales, including catchment scale drought studies in hydrology. They are readily available to study past weather conditions and often serve real time monitoring as well. As these datasets differ in spatial/temporal coverage and spatial/temporal resolution, for most studies there is a tradeoff between these features. Our investigation examines whether biases occur when studying drought on catchment scale with low resolution input data. For that, a comparison among the datasets HYRAS (covering Central Europe, 1x1 km grid, daily data, 1951 - 2005), E-OBS (Europe, 0.25° grid, daily data, 1950-2015) and GPCC (whole world, 0.5° grid, monthly data, 1901 - 2013) is carried out. Generally, biases in precipitation increase with decreasing resolution. Most important variations are found during summer. In low mountain range of Central Europe the datasets of sparse resolution (E-OBS, GPCC) overestimate dry days and underestimate total precipitation since they are not able to describe high spatial variability. However, relative measures like the correlation coefficient reveal good consistencies of dry and wet periods, both for absolute precipitation values and standardized indices like the Standardized Precipitation Index (SPI) or Standardized Precipitation Evaporation Index (SPEI). Particularly the most severe droughts derived from the different datasets match very well. These results indicate that absolute values of sparse resolution datasets applied to catchment scale might be critical to use for an assessment of the hydrological drought at catchment scale, whereas relative measures for determining periods of drought are more trustworthy. Therefore, studies on drought, that downscale meteorological data, should carefully consider their data needs and focus on relative measures for dry periods if sufficient for the task.

  5. Validity and reliability of stillbirth data using linked self-reported and administrative datasets.

    Science.gov (United States)

    Hure, Alexis J; Chojenta, Catherine L; Powers, Jennifer R; Byles, Julie E; Loxton, Deborah

    2015-01-01

    A high rate of stillbirth was previously observed in the Australian Longitudinal Study of Women's Health (ALSWH). Our primary objective was to test the validity and reliability of self-reported stillbirth data linked to state-based administrative datasets. Self-reported data, collected as part of the ALSWH cohort born in 1973-1978, were linked to three administrative datasets for women in New South Wales, Australia (n = 4374): the Midwives Data Collection; Admitted Patient Data Collection; and Perinatal Death Review Database. Linkages were obtained from the Centre for Health Record Linkage for the period 1996-2009. True cases of stillbirth were defined by being consistently recorded in two or more independent data sources. Sensitivity, specificity, positive predictive value, negative predictive value, percent agreement, and kappa statistics were calculated for each dataset. Forty-nine women reported 53 stillbirths. No dataset was 100% accurate. The administrative datasets performed better than self-reported data, with high accuracy and agreement. Self-reported data showed high sensitivity (100%) but low specificity (30%), meaning women who had a stillbirth always reported it, but there was also over-reporting of stillbirths. About half of the misreported cases in the ALSWH were able to be removed by identifying inconsistencies in longitudinal data. Data linkage provides great opportunity to assess the validity and reliability of self-reported study data. Conversely, self-reported study data can help to resolve inconsistencies in administrative datasets. Quantifying the strengths and limitations of both self-reported and administrative data can improve epidemiological research, especially by guiding methods and interpretation of findings.

  6. Datasets collected in general practice: an international comparison using the example of obesity.

    Science.gov (United States)

    Sturgiss, Elizabeth; van Boven, Kees

    2018-06-04

    International datasets from general practice enable the comparison of how conditions are managed within consultations in different primary healthcare settings. The Australian Bettering the Evaluation and Care of Health (BEACH) and TransHIS from the Netherlands collect in-consultation general practice data that have been used extensively to inform local policy and practice. Obesity is a global health issue with different countries applying varying approaches to management. The objective of the present paper is to compare the primary care management of obesity in Australia and the Netherlands using data collected from consultations. Despite the different prevalence in obesity in the two countries, the number of patients per 1000 patient-years seen with obesity is similar. Patients in Australia with obesity are referred to allied health practitioners more often than Dutch patients. Without quality general practice data, primary care researchers will not have data about the management of conditions within consultations. We use obesity to highlight the strengths of these general practice data sources and to compare their differences. What is known about the topic? Australia had one of the longest-running consecutive datasets about general practice activity in the world, but it has recently lost government funding. The Netherlands has a longitudinal general practice dataset of information collected within consultations since 1985. What does this paper add? We discuss the benefits of general practice-collected data in two countries. Using obesity as a case example, we compare management in general practice between Australia and the Netherlands. This type of analysis should start all international collaborations of primary care management of any health condition. Having a national general practice dataset allows international comparisons of the management of conditions with primary care. Without a current, quality general practice dataset, primary care researchers will not

  7. Associating uncertainty with datasets using Linked Data and allowing propagation via provenance chains

    Science.gov (United States)

    Car, Nicholas; Cox, Simon; Fitch, Peter

    2015-04-01

    With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty

  8. Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

    Science.gov (United States)

    Keuleers, Emmanuel; Balota, David A

    2015-01-01

    This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.

  9. Multilayered complex network datasets for three supply chain network archetypes on an urban road grid.

    Science.gov (United States)

    Viljoen, Nadia M; Joubert, Johan W

    2018-02-01

    This article presents the multilayered complex network formulation for three different supply chain network archetypes on an urban road grid and describes how 500 instances were randomly generated for each archetype. Both the supply chain network layer and the urban road network layer are directed unweighted networks. The shortest path set is calculated for each of the 1 500 experimental instances. The datasets are used to empirically explore the impact that the supply chain's dependence on the transport network has on its vulnerability in Viljoen and Joubert (2017) [1]. The datasets are publicly available on Mendeley (Joubert and Viljoen, 2017) [2].

  10. NGO Presence and Activity in Afghanistan, 2000–2014: A Provincial-Level Dataset

    Directory of Open Access Journals (Sweden)

    David F. Mitchell

    2017-06-01

    Full Text Available This article introduces a new provincial-level dataset on non-governmental organizations (NGOs in Afghanistan. The data—which are freely available for download—provide information on the locations and sectors of activity of 891 international and local (Afghan NGOs that operated in the country between 2000 and 2014. A summary and visualization of the data is presented in the article following a brief historical overview of NGOs in Afghanistan. Links to download the full dataset are provided in the conclusion.

  11. Preconditioned dynamic mode decomposition and mode selection algorithms for large datasets using incremental proper orthogonal decomposition

    Science.gov (United States)

    Ohmichi, Yuya

    2017-07-01

    In this letter, we propose a simple and efficient framework of dynamic mode decomposition (DMD) and mode selection for large datasets. The proposed framework explicitly introduces a preconditioning step using an incremental proper orthogonal decomposition (POD) to DMD and mode selection algorithms. By performing the preconditioning step, the DMD and mode selection can be performed with low memory consumption and therefore can be applied to large datasets. Additionally, we propose a simple mode selection algorithm based on a greedy method. The proposed framework is applied to the analysis of three-dimensional flow around a circular cylinder.

  12. The Planetary Science Archive (PSA): Exploration and discovery of scientific datasets from ESA's planetary missions

    Science.gov (United States)

    Vallat, C.; Besse, S.; Barbarisi, I.; Arviset, C.; De Marchi, G.; Barthelemy, M.; Coia, D.; Costa, M.; Docasal, R.; Fraga, D.; Heather, D. J.; Lim, T.; Macfarlane, A.; Martinez, S.; Rios, C.; Vallejo, F.; Said, J.

    2017-09-01

    The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://psa.esa.int. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA has started to implement a number of significant improvements, mostly driven by the evolution of the PDS standards, and the growing need for better interfaces and advanced applications to support science exploitation.

  13. Error characterisation of global active and passive microwave soil moisture datasets

    Directory of Open Access Journals (Sweden)

    W. A. Dorigo

    2010-12-01

    Full Text Available Understanding the error structures of remotely sensed soil moisture observations is essential for correctly interpreting observed variations and trends in the data or assimilating them in hydrological or numerical weather prediction models. Nevertheless, a spatially coherent assessment of the quality of the various globally available datasets is often hampered by the limited availability over space and time of reliable in-situ measurements. As an alternative, this study explores the triple collocation error estimation technique for assessing the relative quality of several globally available soil moisture products from active (ASCAT and passive (AMSR-E and SSM/I microwave sensors. The triple collocation is a powerful statistical tool to estimate the root mean square error while simultaneously solving for systematic differences in the climatologies of a set of three linearly related data sources with independent error structures. Prerequisite for this technique is the availability of a sufficiently large number of timely corresponding observations. In addition to the active and passive satellite-based datasets, we used the ERA-Interim and GLDAS-NOAH reanalysis soil moisture datasets as a third, independent reference. The prime objective is to reveal trends in uncertainty related to different observation principles (passive versus active, the use of different frequencies (C-, X-, and Ku-band for passive microwave observations, and the choice of the independent reference dataset (ERA-Interim versus GLDAS-NOAH. The results suggest that the triple collocation method provides realistic error estimates. Observed spatial trends agree well with the existing theory and studies on the performance of different observation principles and frequencies with respect to land cover and vegetation density. In addition, if all theoretical prerequisites are fulfilled (e.g. a sufficiently large number of common observations is available and errors of the different

  14. Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets.

    Science.gov (United States)

    Washburne, Alex D; Silverman, Justin D; Leff, Jonathan W; Bennett, Dominic J; Darcy, John L; Mukherjee, Sayan; Fierer, Noah; David, Lawrence A

    2017-01-01

    Marker gene sequencing of microbial communities has generated big datasets of microbial relative abundances varying across environmental conditions, sample sites and treatments. These data often come with putative phylogenies, providing unique opportunities to investigate how shared evolutionary history affects microbial abundance patterns. Here, we present a method to identify the phylogenetic factors driving patterns in microbial community composition. We use the method, "phylofactorization," to re-analyze datasets from the human body and soil microbial communities, demonstrating how phylofactorization is a dimensionality-reducing tool, an ordination-visualization tool, and an inferential tool for identifying edges in the phylogeny along which putative functional ecological traits may have arisen.

  15. Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.

    Science.gov (United States)

    Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi

    2015-09-01

    Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the

  16. Elsevier’s approach to the bioCADDIE 2016 Dataset Retrieval Challenge

    Science.gov (United States)

    Scerri, Antony; Kuriakose, John; Deshmane, Amit Ajit; Stanger, Mark; Moore, Rebekah; Naik, Raj; de Waard, Anita

    2017-01-01

    Abstract We developed a two-stream, Apache Solr-based information retrieval system in response to the bioCADDIE 2016 Dataset Retrieval Challenge. One stream was based on the principle of word embeddings, the other was rooted in ontology based indexing. Despite encountering several issues in the data, the evaluation procedure and the technologies used, the system performed quite well. We provide some pointers towards future work: in particular, we suggest that more work in query expansion could benefit future biomedical search engines. Database URL: https://data.mendeley.com/datasets/zd9dxpyybg/1 PMID:29220454

  17. Dataset of anomalies and malicious acts in a cyber-physical subsystem.

    Science.gov (United States)

    Laso, Pedro Merino; Brosset, David; Puentes, John

    2017-10-01

    This article presents a dataset produced to investigate how data and information quality estimations enable to detect aNomalies and malicious acts in cyber-physical systems. Data were acquired making use of a cyber-physical subsystem consisting of liquid containers for fuel or water, along with its automated control and data acquisition infrastructure. Described data consist of temporal series representing five operational scenarios - Normal, aNomalies, breakdown, sabotages, and cyber-attacks - corresponding to 15 different real situations. The dataset is publicly available in the .zip file published with the article, to investigate and compare faulty operation detection and characterization methods for cyber-physical systems.

  18. Datasets of mung bean proteins and metabolites from four different cultivars

    Directory of Open Access Journals (Sweden)

    Akiko Hashiguchi

    2017-08-01

    Full Text Available Plants produce a wide array of nutrients that exert synergistic interaction among whole combinations of nutrients. Therefore comprehensive nutrient profiling is required to evaluate their nutritional/nutraceutical value and health promoting effect. In order to obtain such datasets for mung bean, which is known as a medicinal plant with heat alleviating effect, proteomic and metabolomic analyses were performed using four cultivars from China, Thailand, and Myanmar. In total, 449 proteins and 210 metabolic compounds were identified in seed coat; whereas 480 proteins and 217 metabolic compounds were detected in seed flesh, establishing the first comprehensive dataset of mung bean for nutraceutical evaluation.

  19. The effects of spatial population dataset choice on estimates of population at risk of disease

    Directory of Open Access Journals (Sweden)

    Gething Peter W

    2011-02-01

    Full Text Available Abstract Background The spatial modeling of infectious disease distributions and dynamics is increasingly being undertaken for health services planning and disease control monitoring, implementation, and evaluation. Where risks are heterogeneous in space or dependent on person-to-person transmission, spatial data on human population distributions are required to estimate infectious disease risks, burdens, and dynamics. Several different modeled human population distribution datasets are available and widely used, but the disparities among them and the implications for enumerating disease burdens and populations at risk have not been considered systematically. Here, we quantify some of these effects using global estimates of populations at risk (PAR of P. falciparum malaria as an example. Methods The recent construction of a global map of P. falciparum malaria endemicity enabled the testing of different gridded population datasets for providing estimates of PAR by endemicity class. The estimated population numbers within each class were calculated for each country using four different global gridded human population datasets: GRUMP (~1 km spatial resolution, LandScan (~1 km, UNEP Global Population Databases (~5 km, and GPW3 (~5 km. More detailed assessments of PAR variation and accuracy were conducted for three African countries where census data were available at a higher administrative-unit level than used by any of the four gridded population datasets. Results The estimates of PAR based on the datasets varied by more than 10 million people for some countries, even accounting for the fact that estimates of population totals made by different agencies are used to correct national totals in these datasets and can vary by more than 5% for many low-income countries. In many cases, these variations in PAR estimates comprised more than 10% of the total national population. The detailed country-level assessments suggested that none of the datasets was

  20. Asia-MIP: Multi Model-data Synthesis of Terrestrial Carbon Cycles in Asia

    Science.gov (United States)

    Ichii, K.; Kondo, M.; Ito, A.; Kang, M.; Sasai, T.; SATO, H.; Ueyama, M.; Kobayashi, H.; Saigusa, N.; Kim, J.

    2013-12-01

    Asia, which is characterized by monsoon climate and intense human activities, is one of the prominent understudied regions in terms of terrestrial carbon budgets and mechanisms of carbon exchange. To better understand terrestrial carbon cycle in Asia, we initiated multi-model and data intercomparison project in Asia (Asia-MIP). We analyzed outputs from multiple approaches: satellite-based observations (AVHRR and MODIS) and related products, empirically upscaled estimations (Support Vector Regression) using eddy-covariance observation network in Asia (AsiaFlux, CarboEastAsia, FLUXNET), ~10 terrestrial biosphere models (e.g. BEAMS, Biome-BGC, LPJ, SEIB-DGVM, TRIFFID, VISIT models), and atmospheric inversion analysis (e.g. TransCom models). We focused on the two difference temporal coverage: long-term (30 years; 1982-2011) and decadal (10 years; 2001-2010; data intensive period) scales. The regions of covering Siberia, Far East Asia, East Asia, Southeast Asia and South Asia (60-80E, 10S-80N), was analyzed in this study for assessing the magnitudes, interannual variability, and key driving factors of carbon cycles. We will report the progress of synthesis effort to quantify terrestrial carbon budget in Asia. First, we analyzed the recent trends in Gross Primary Productivities (GPP) using satellite-based observation (AVHRR) and multiple terrestrial biosphere models. We found both model outputs and satellite-based observation consistently show an increasing trend in GPP in most of the regions in Asia. Mechanisms of the GPP increase were analyzed using models, and changes in temperature and precipitation play dominant roles in GPP increase in boreal and temperate regions, whereas changes in atmospheric CO2 and precipitation are important in tropical regions. However, their relative contributions were different. Second, in the decadal analysis (2001-2010), we found that the negative GPP and carbon uptake anomalies in 2003 summer in Far East Asia is one of the largest