potential distribution dataset: Topics by WorldWideScience.org

Sample records for potential distribution dataset

Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

Science.gov (United States)

Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

2014-01-01

Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution
Sensitivity of Distributions of Climate System Properties to Surface Temperature Datasets

Science.gov (United States)

Libardoni, A. G.; Forest, C. E.

2011-12-01

Predictions of climate change from models depend strongly on the representation of climate system properties emerging from the processes and feedbacks in the models. The quality of any model prediction can be evaluated by determining how well its output reproduces the observed climate system. With this evaluation, the reliability of climate projections derived from the model and provided for policy makers is assessed and quantified. In this study, surface temperature, upper-air temperature, and ocean heat content data are used to constrain the distributions of the parameters that define three climate system properties in the MIT Integrated Global Systems Model: climate sensitivity, the rate of ocean heat uptake into the deep ocean, and net anthropogenic aerosol forcing. In particular, we explore the sensitivity of the distributions to the surface temperature dataset used to estimate the likelihood of model output given the observed climate records. In total, five different reconstructions of past surface temperatures are used and the resulting parameter distribution functions differ from each other. Differences in estimates of climate sensitivity mode and mean are as great as 1 K between the datasets, with an overall range of 1.2 to 5.3 K using the 5-95 confidence intervals. Ocean effective diffusivity is poorly constrained regardless of which dataset is used. All distributions show broad distributions and only three show signs of a distribution mode. When a mode is present, they tend to be for low diffusivity values. Distributions for the net aerosol forcing show similar shapes and cluster into two groups that are shifted by approximately 0.1 watts per square meter. However, the overall spread of forcing values from the 5-95 confidence interval, -0.19 to -0.83 watts per square meter, is small compared to other uncertainties in climate forcings. Transient climate response estimates derived from these distributions range between 0.87 and 2.41 K. Similar to the
Global distribution of urban parameters derived from high-resolution global datasets for weather modelling

Science.gov (United States)

Kawano, N.; Varquez, A. C. G.; Dong, Y.; Kanda, M.

2016-12-01

Numerical model such as Weather Research and Forecasting model coupled with single-layer Urban Canopy Model (WRF-UCM) is one of the powerful tools to investigate urban heat island. Urban parameters such as average building height (Have), plain area index (λp) and frontal area index (λf), are necessary inputs for the model. In general, these parameters are uniformly assumed in WRF-UCM but this leads to unrealistic urban representation. Distributed urban parameters can also be incorporated into WRF-UCM to consider a detail urban effect. The problem is that distributed building information is not readily available for most megacities especially in developing countries. Furthermore, acquiring real building parameters often require huge amount of time and money. In this study, we investigated the potential of using globally available satellite-captured datasets for the estimation of the parameters, Have, λp, and λf. Global datasets comprised of high spatial resolution population dataset (LandScan by Oak Ridge National Laboratory), nighttime lights (NOAA), and vegetation fraction (NASA). True samples of Have, λp, and λf were acquired from actual building footprints from satellite images and 3D building database of Tokyo, New York, Paris, Melbourne, Istanbul, Jakarta and so on. Regression equations were then derived from the block-averaging of spatial pairs of real parameters and global datasets. Results show that two regression curves to estimate Have and λf from the combination of population and nightlight are necessary depending on the city's level of development. An index which can be used to decide which equation to use for a city is the Gross Domestic Product (GDP). On the other hand, λphas less dependence on GDP but indicated a negative relationship to vegetation fraction. Finally, a simplified but precise approximation of urban parameters through readily-available, high-resolution global datasets and our derived regressions can be utilized to estimate a
Parton Distributions based on a Maximally Consistent Dataset

Science.gov (United States)

Rojo, Juan

2016-04-01

The choice of data that enters a global QCD analysis can have a substantial impact on the resulting parton distributions and their predictions for collider observables. One of the main reasons for this has to do with the possible presence of inconsistencies, either internal within an experiment or external between different experiments. In order to assess the robustness of the global fit, different definitions of a conservative PDF set, that is, a PDF set based on a maximally consistent dataset, have been introduced. However, these approaches are typically affected by theory biases in the selection of the dataset. In this contribution, after a brief overview of recent NNPDF developments, we propose a new, fully objective, definition of a conservative PDF set, based on the Bayesian reweighting approach. Using the new NNPDF3.0 framework, we produce various conservative sets, which turn out to be mutually in agreement within the respective PDF uncertainties, as well as with the global fit. We explore some of their implications for LHC phenomenology, finding also good consistency with the global fit result. These results provide a non-trivial validation test of the new NNPDF3.0 fitting methodology, and indicate that possible inconsistencies in the fitted dataset do not affect substantially the global fit PDFs.
Research on social communication network evolution based on topology potential distribution

Science.gov (United States)

Zhao, Dongjie; Jiang, Jian; Li, Deyi; Zhang, Haisu; Chen, Guisheng

2011-12-01

Aiming at the problem of social communication network evolution, first, topology potential is introduced to measure the local influence among nodes in networks. Second, from the perspective of topology potential distribution the method of network evolution description based on topology potential distribution is presented, which takes the artificial intelligence with uncertainty as basic theory and local influence among nodes as essentiality. Then, a social communication network is constructed by enron email dataset, the method presented is used to analyze the characteristic of the social communication network evolution and some useful conclusions are got, implying that the method is effective, which shows that topology potential distribution can effectively describe the characteristic of sociology and detect the local changes in social communication network.
Reconstructing missing information on precipitation datasets: impact of tails on adopted statistical distributions.

Science.gov (United States)

Pedretti, Daniele; Beckie, Roger Daniel

2014-05-01

Missing data in hydrological time-series databases are ubiquitous in practical applications, yet it is of fundamental importance to make educated decisions in problems involving exhaustive time-series knowledge. This includes precipitation datasets, since recording or human failures can produce gaps in these time series. For some applications, directly involving the ratio between precipitation and some other quantity, lack of complete information can result in poor understanding of basic physical and chemical dynamics involving precipitated water. For instance, the ratio between precipitation (recharge) and outflow rates at a discharge point of an aquifer (e.g. rivers, pumping wells, lysimeters) can be used to obtain aquifer parameters and thus to constrain model-based predictions. We tested a suite of methodologies to reconstruct missing information in rainfall datasets. The goal was to obtain a suitable and versatile method to reduce the errors given by the lack of data in specific time windows. Our analyses included both a classical chronologically-pairing approach between rainfall stations and a probability-based approached, which accounted for the probability of exceedence of rain depths measured at two or multiple stations. Our analyses proved that it is not clear a priori which method delivers the best methodology. Rather, this selection should be based considering the specific statistical properties of the rainfall dataset. In this presentation, our emphasis is to discuss the effects of a few typical parametric distributions used to model the behavior of rainfall. Specifically, we analyzed the role of distributional "tails", which have an important control on the occurrence of extreme rainfall events. The latter strongly affect several hydrological applications, including recharge-discharge relationships. The heavy-tailed distributions we considered were parametric Log-Normal, Generalized Pareto, Generalized Extreme and Gamma distributions. The methods were
A Multi-Resolution Spatial Model for Large Datasets Based on the Skew-t Distribution

KAUST Repository

Tagle, Felipe

2017-12-06

Large, non-Gaussian spatial datasets pose a considerable modeling challenge as the dependence structure implied by the model needs to be captured at different scales, while retaining feasible inference. Skew-normal and skew-t distributions have only recently begun to appear in the spatial statistics literature, without much consideration, however, for the ability to capture dependence at multiple resolutions, and simultaneously achieve feasible inference for increasingly large data sets. This article presents the first multi-resolution spatial model inspired by the skew-t distribution, where a large-scale effect follows a multivariate normal distribution and the fine-scale effects follow a multivariate skew-normal distributions. The resulting marginal distribution for each region is skew-t, thereby allowing for greater flexibility in capturing skewness and heavy tails characterizing many environmental datasets. Likelihood-based inference is performed using a Monte Carlo EM algorithm. The model is applied as a stochastic generator of daily wind speeds over Saudi Arabia.
Distributed solar photovoltaic array location and extent dataset for remote sensing object identification

Science.gov (United States)

Bradbury, Kyle; Saboo, Raghav; L. Johnson, Timothy; Malof, Jordan M.; Devarajan, Arjun; Zhang, Wuming; M. Collins, Leslie; G. Newell, Richard

2016-12-01

Earth-observing remote sensing data, including aerial photography and satellite imagery, offer a snapshot of the world from which we can learn about the state of natural resources and the built environment. The components of energy systems that are visible from above can be automatically assessed with these remote sensing data when processed with machine learning methods. Here, we focus on the information gap in distributed solar photovoltaic (PV) arrays, of which there is limited public data on solar PV deployments at small geographic scales. We created a dataset of solar PV arrays to initiate and develop the process of automatically identifying solar PV locations using remote sensing imagery. This dataset contains the geospatial coordinates and border vertices for over 19,000 solar panels across 601 high-resolution images from four cities in California. Dataset applications include training object detection and other machine learning algorithms that use remote sensing imagery, developing specific algorithms for predictive detection of distributed PV systems, estimating installed PV capacity, and analysis of the socioeconomic correlates of PV deployment.
Productivity Levels in Distributive Trades : A New ICOP Dataset for OECD Countries

NARCIS (Netherlands)

Timmer, Marcel P.; Ypma, Gerard

2006-01-01

This study provides a new dataset for international comparisons of labour productivity levels in distributive trade (retail and wholesale trade) between OECD countries. The productivity level comparisons are based on a harmonised set of Purchasing Power Parities (PPPs) for 1997 using the
SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data

Science.gov (United States)

Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.

2015-12-01

Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These
Spatiotemporal dataset on Chinese population distribution and its driving factors from 1949 to 2013

Science.gov (United States)

Wang, Lizhe; Chen, Lajiao

2016-07-01

Spatio-temporal data on human population and its driving factors is critical to understanding and responding to population problems. Unfortunately, such spatio-temporal data on a large scale and over the long term are often difficult to obtain. Here, we present a dataset on Chinese population distribution and its driving factors over a remarkably long period, from 1949 to 2013. Driving factors of population distribution were selected according to the push-pull migration laws, which were summarized into four categories: natural environment, natural resources, economic factors and social factors. Natural environment and natural resources indicators were calculated using Geographic Information System (GIS) and Remote Sensing (RS) techniques, whereas economic and social factors from 1949 to 2013 were collected from the China Statistical Yearbook and China Compendium of Statistics from 1949 to 2008. All of the data were quality controlled and unified into an identical dataset with the same spatial scope and time period. The dataset is expected to be useful for understanding how population responds to and impacts environmental change.
Genome-wide gene expression dataset used to identify potential therapeutic targets in androgenetic alopecia

Directory of Open Access Journals (Sweden)

R. Dey-Rao

2017-08-01

Full Text Available The microarray dataset attached to this report is related to the research article with the title: “A genomic approach to susceptibility and pathogenesis leads to identifying potential novel therapeutic targets in androgenetic alopecia” (Dey-Rao and Sinha, 2017 [1]. Male-pattern hair loss that is induced by androgens (testosterone in genetically predisposed individuals is known as androgenetic alopecia (AGA. The raw dataset is being made publicly available to enable critical and/or extended analyses. Our related research paper utilizes the attached raw dataset, for genome-wide gene-expression associated investigations. Combined with several in silico bioinformatics-based analyses we were able to delineate five strategic molecular elements as potential novel targets towards future AGA-therapy.
Global heating distributions for January 1979 calculated from GLA assimilated and simulated model-based datasets

Science.gov (United States)

Schaack, Todd K.; Lenzen, Allen J.; Johnson, Donald R.

1991-01-01

This study surveys the large-scale distribution of heating for January 1979 obtained from five sources of information. Through intercomparison of these distributions, with emphasis on satellite-derived information, an investigation is conducted into the global distribution of atmospheric heating and the impact of observations on the diagnostic estimates of heating derived from assimilated datasets. The results indicate a substantial impact of satellite information on diagnostic estimates of heating in regions where there is a scarcity of conventional observations. The addition of satellite data provides information on the atmosphere's temperature and wind structure that is important for estimation of the global distribution of heating and energy exchange.
Modelling the potential distribution of Betula utilis in the Himalaya

Directory of Open Access Journals (Sweden)

Maria Bobrowski

2017-07-01

Full Text Available Developing sustainable adaptation pathways under climate change conditions in mountain regions requires accurate predictions of treeline shifts and future distribution ranges of treeline species. Here, we model for the first time the potential distribution of Betula utilis, a principal Himalayan treeline species, to provide a basis for the analysis of future range shifts. Our target species Betula utilis is widespread at alpine treelines in the Himalayan mountains, the distribution range extends across the Himalayan mountain range. Our objective is to model the potential distribution of B. utilis in relation to current climate conditions. We generated a dataset of 590 occurrence records and used 24 variables for ecological niche modelling. We calibrated Generalized Linear Models using the Akaike Information Criterion (AIC and evaluated model performance using threshold-independent (AUC, Area Under the Curve and threshold-dependent (TSS, True Skill Statistics characteristics as well as visual assessments of projected distribution maps. We found two temperature-related (Mean Temperature of the Wettest Quarter, Temperature Annual Range and three precipitation-related variables (Precipitation of the Coldest Quarter, Average Precipitation of March, April and May and Precipitation Seasonality to be useful for predicting the potential distribution of B. utilis. All models had high predictive power (AUC ≥ 0.98 and TSS ≥ 0.89. The projected suitable area in the Himalayan mountains varies considerably, with most extensive distribution in the western and central Himalayan region. A substantial difference between potential and real distribution in the eastern Himalaya points to decreasing competitiveness of B. utilis under more oceanic conditions in the eastern part of the mountain system. A comparison between the vegetation map of Schweinfurth (1957 and our current predictions suggests that B. utilis does not reach the upper elevational limit in
One click dataset transfer: toward efficient coupling of distributed storage resources and CPUs

Czech Academy of Sciences Publication Activity Database

Zerola, Michal; Lauret, J.; Barták, R.; Šumbera, Michal

2012-01-01

Roč. 368, 012022 (2012), s. 1-10 ISSN 1742-6588. [14th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT). Uxbridge, 05.09.2011-09.09.2011] R&D Projects: GA MŠk LC07048; GA MŠk LA09013 Institutional support: RVO:61389005 Keywords : distributed storage * Grid computing * dataset transfer Subject RIV: BG - Nuclear, Atomic and Molecular Physics, Colliders http://iopscience.iop.org/1742-6596/368/1/012022/pdf/1742-6596_368_1_012022.pdf
A dataset mapping the potential biophysical effects of vegetation cover change

Science.gov (United States)

Duveiller, Gregory; Hooker, Josh; Cescatti, Alessandro

2018-02-01

Changing the vegetation cover of the Earth has impacts on the biophysical properties of the surface and ultimately on the local climate. Depending on the specific type of vegetation change and on the background climate, the resulting competing biophysical processes can have a net warming or cooling effect, which can further vary both spatially and seasonally. Due to uncertain climate impacts and the lack of robust observations, biophysical effects are not yet considered in land-based climate policies. Here we present a dataset based on satellite remote sensing observations that provides the potential changes i) of the full surface energy balance, ii) at global scale, and iii) for multiple vegetation transitions, as would now be required for the comprehensive evaluation of land based mitigation plans. We anticipate that this dataset will provide valuable information to benchmark Earth system models, to assess future scenarios of land cover change and to develop the monitoring, reporting and verification guidelines required for the implementation of mitigation plans that account for biophysical land processes.
The Geometry of Finite Equilibrium Datasets

DEFF Research Database (Denmark)

Balasko, Yves; Tvede, Mich

We investigate the geometry of finite datasets defined by equilibrium prices, income distributions, and total resources. We show that the equilibrium condition imposes no restrictions if total resources are collinear, a property that is robust to small perturbations. We also show that the set...... of equilibrium datasets is pathconnected when the equilibrium condition does impose restrictions on datasets, as for example when total resources are widely non collinear....
HARVESTING, INTEGRATING AND DISTRIBUTING LARGE OPEN GEOSPATIAL DATASETS USING FREE AND OPEN-SOURCE SOFTWARE

Directory of Open Access Journals (Sweden)

R. Oliveira

2016-06-01

Full Text Available Federal, State and Local government agencies in the USA are investing heavily on the dissemination of Open Data sets produced by each of them. The main driver behind this thrust is to increase agencies’ transparency and accountability, as well as to improve citizens’ awareness. However, not all Open Data sets are easy to access and integrate with other Open Data sets available even from the same agency. The City and County of Denver Open Data Portal distributes several types of geospatial datasets, one of them is the city parcels information containing 224,256 records. Although this data layer contains many pieces of information it is incomplete for some custom purposes. Open-Source Software were used to first collect data from diverse City of Denver Open Data sets, then upload them to a repository in the Cloud where they were processed using a PostgreSQL installation on the Cloud and Python scripts. Our method was able to extract non-spatial information from a ‘not-ready-to-download’ source that could then be combined with the initial data set to enhance its potential use.
Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets

Science.gov (United States)

Juric, Mario

2011-01-01

The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.
Data-driven mapping of the potential mountain permafrost distribution.

Science.gov (United States)

Deluigi, Nicola; Lambiel, Christophe; Kanevski, Mikhail

2017-07-15

Existing mountain permafrost distribution models generally offer a good overview of the potential extent of this phenomenon at a regional scale. They are however not always able to reproduce the high spatial discontinuity of permafrost at the micro-scale (scale of a specific landform; ten to several hundreds of meters). To overcome this lack, we tested an alternative modelling approach using three classification algorithms belonging to statistics and machine learning: Logistic regression, Support Vector Machines and Random forests. These supervised learning techniques infer a classification function from labelled training data (pixels of permafrost absence and presence) with the aim of predicting the permafrost occurrence where it is unknown. The research was carried out in a 588km 2 area of the Western Swiss Alps. Permafrost evidences were mapped from ortho-image interpretation (rock glacier inventorying) and field data (mainly geoelectrical and thermal data). The relationship between selected permafrost evidences and permafrost controlling factors was computed with the mentioned techniques. Classification performances, assessed with AUROC, range between 0.81 for Logistic regression, 0.85 with Support Vector Machines and 0.88 with Random forests. The adopted machine learning algorithms have demonstrated to be efficient for permafrost distribution modelling thanks to consistent results compared to the field reality. The high resolution of the input dataset (10m) allows elaborating maps at the micro-scale with a modelled permafrost spatial distribution less optimistic than classic spatial models. Moreover, the probability output of adopted algorithms offers a more precise overview of the potential distribution of mountain permafrost than proposing simple indexes of the permafrost favorability. These encouraging results also open the way to new possibilities of permafrost data analysis and mapping. Copyright © 2017 Elsevier B.V. All rights reserved.

Comparison of CORA and EN4 in-situ datasets validation methods, toward a better quality merged dataset.

Science.gov (United States)

Szekely, Tanguy; Killick, Rachel; Gourrion, Jerome; Reverdin, Gilles

2017-04-01

CORA and EN4 are both global delayed time mode validated in-situ ocean temperature and salinity datasets distributed by the Met Office (http://www.metoffice.gov.uk/) and Copernicus (www.marine.copernicus.eu). A large part of the profiles distributed by CORA and EN4 in recent years are Argo profiles from the ARGO DAC, but profiles are also extracted from the World Ocean Database and TESAC profiles from GTSPP. In the case of CORA, data coming from the EUROGOOS Regional operationnal oserving system( ROOS) operated by European institutes no managed by National Data Centres and other datasets of profiles povided by scientific sources can also be found (Sea mammals profiles from MEOP, XBT datasets from cruises ...). (EN4 also takes data from the ASBO dataset to supplement observations in the Arctic). First advantage of this new merge product is to enhance the space and time coverage at global and european scales for the period covering 1950 till a year before the current year. This product is updated once a year and T&S gridded fields are alos generated for the period 1990-year n-1. The enhancement compared to the revious CORA product will be presented Despite the fact that the profiles distributed by both datasets are mostly the same, the quality control procedures developed by the Met Office and Copernicus teams differ, sometimes leading to different quality control flags for the same profile. Started in 2016 a new study started that aims to compare both validation procedures to move towards a Copernicus Marine Service dataset with the best features of CORA and EN4 validation.A reference data set composed of the full set of in-situ temperature and salinity measurements collected by Coriolis during 2015 is used. These measurements have been made thanks to wide range of instruments (XBTs, CTDs, Argo floats, Instrumented sea mammals,...), covering the global ocean. The reference dataset has been validated simultaneously by both teams.An exhaustive comparison of the
Knowledge Evolution in Distributed Geoscience Datasets and the Role of Semantic Technologies

Science.gov (United States)

Ma, X.

2014-12-01

Knowledge evolves in geoscience, and the evolution is reflected in datasets. In a context with distributed data sources, the evolution of knowledge may cause considerable challenges to data management and re-use. For example, a short news published in 2009 (Mascarelli, 2009) revealed the geoscience community's concern that the International Commission on Stratigraphy's change to the definition of Quaternary may bring heavy reworking of geologic maps. Now we are in the era of the World Wide Web, and geoscience knowledge is increasingly modeled and encoded in the form of ontologies and vocabularies by using semantic technologies. Accordingly, knowledge evolution leads to a consequence called ontology dynamics. Flouris et al. (2008) summarized 10 topics of general ontology changes/dynamics such as: ontology mapping, morphism, evolution, debugging and versioning, etc. Ontology dynamics makes impacts at several stages of a data life cycle and causes challenges, such as: the request for reworking of the extant data in a data center, semantic mismatch among data sources, differentiated understanding of a same piece of dataset between data providers and data users, as well as error propagation in cross-discipline data discovery and re-use (Ma et al., 2014). This presentation will analyze the best practices in the geoscience community so far and summarize a few recommendations to reduce the negative impacts of ontology dynamics in a data life cycle, including: communities of practice and collaboration on ontology and vocabulary building, link data records to standardized terms, and methods for (semi-)automatic reworking of datasets using semantic technologies. References: Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G., 2008. Ontology change: classification and survey. The Knowledge Engineering Review 23 (2), 117-152. Ma, X., Fox, P., Rozell, E., West, P., Zednik, S., 2014. Ontology dynamics in a data life cycle: Challenges and recommendations
A Dataset for Three-Dimensional Distribution of 39 Elements Including Plant Nutrients and Other Metals and Metalloids in the Soils of a Forested Headwater Catchment.

Science.gov (United States)

Wu, B; Wiekenkamp, I; Sun, Y; Fisher, A S; Clough, R; Gottselig, N; Bogena, H; Pütz, T; Brüggemann, N; Vereecken, H; Bol, R

2017-11-01

Quantification and evaluation of elemental distribution in forested ecosystems are key requirements to understand element fluxes and their relationship with hydrological and biogeochemical processes in the system. However, datasets supporting such a study on the catchment scale are still limited. Here we provide a dataset comprising spatially highly resolved distributions of 39 elements in soil profiles of a small forested headwater catchment in western Germany () to gain a holistic picture of the state and fluxes of elements in the catchment. The elements include both plant nutrients and other metals and metalloids that were predominately derived from lithospheric or anthropogenic inputs, thereby allowing us to not only capture the nutrient status of the catchment but to also estimate the functional development of the ecosystem. Soil samples were collected at high lateral resolution (≤60 m), and element concentrations were determined vertically for four soil horizons (L/Of, Oh, A, B). From this, a three-dimensional view of the distribution of these elements could be established with high spatial resolution on the catchment scale in a temperate natural forested ecosystem. The dataset can be combined with other datasets and studies of the TERENO (Terrestrial Environmental Observatories) Data Discovery Portal () to reveal elemental fluxes, establish relations between elements and other soil properties, and/or as input for modeling elemental cycling in temperate forested ecosystems. Copyright © by the American Society of Agronomy, Crop Science Society of America, and Soil Science Society of America, Inc.
On sample size and different interpretations of snow stability datasets

Science.gov (United States)

Schirmer, M.; Mitterer, C.; Schweizer, J.

2009-04-01

aspect distributions to the large dataset. We used 100 different subsets for each sample size. Statistical variations obtained in the complete dataset were also tested on the smaller subsets using the Mann-Whitney or the Kruskal-Wallis test. For each subset size, the number of subsets were counted in which the significance level was reached. For these tests no nominal data scale was assumed. (iii) For the same subsets described above, the distribution of the aspect median was determined. A count of how often this distribution was substantially different from the distribution obtained with the complete dataset was made. Since two valid stability interpretations were available (an objective and a subjective interpretation as described above), the effect of the arbitrary choice of the interpretation on spatial variability results was tested. In over one third of the cases the two interpretations came to different results. The effect of these differences were studied in a similar method as described in (iii): the distribution of the aspect median was determined for subsets of the complete dataset using both interpretations, compared against each other as well as to the results of the complete dataset. For the complete dataset the two interpretations showed mainly identical results. Therefore the subset size was determined from the point at which the results of the two interpretations converged. A universal result for the optimal subset size cannot be presented since results differed between different situations contained in the dataset. The optimal subset size is thus dependent on stability variation in a given situation, which is unknown initially. There are indications that for some situations even the complete dataset might be not large enough. At a subset size of approximately 25, the significant differences between aspect groups (as determined using the whole dataset) were only obtained in one out of five situations. In some situations, up to 20% of the subsets showed a
Data Mining for Imbalanced Datasets: An Overview

Science.gov (United States)

Chawla, Nitesh V.

A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.
Background qualitative analysis of the European reference life cycle database (ELCD) energy datasets - part II: electricity datasets.

Science.gov (United States)

Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

2015-01-01

The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.
Framework for Interactive Parallel Dataset Analysis on the Grid

Energy Technology Data Exchange (ETDEWEB)

Alexander, David A.; Ananthan, Balamurali; /Tech-X Corp.; Johnson, Tony; Serbo, Victor; /SLAC

2007-01-10

We present a framework for use at a typical Grid site to facilitate custom interactive parallel dataset analysis targeting terabyte-scale datasets of the type typically produced by large multi-institutional science experiments. We summarize the needs for interactive analysis and show a prototype solution that satisfies those needs. The solution consists of desktop client tool and a set of Web Services that allow scientists to sign onto a Grid site, compose analysis script code to carry out physics analysis on datasets, distribute the code and datasets to worker nodes, collect the results back to the client, and to construct professional-quality visualizations of the results.
Modeling the Hydrological Regime of Turkana Lake (Kenya, Ethiopia) by Combining Spatially Distributed Hydrological Modeling and Remote Sensing Datasets

Science.gov (United States)

Anghileri, D.; Kaelin, A.; Peleg, N.; Fatichi, S.; Molnar, P.; Roques, C.; Longuevergne, L.; Burlando, P.

2017-12-01

Hydrological modeling in poorly gauged basins can benefit from the use of remote sensing datasets although there are challenges associated with the mismatch in spatial and temporal scales between catchment scale hydrological models and remote sensing products. We model the hydrological processes and long-term water budget of the Lake Turkana catchment, a transboundary basin between Kenya and Ethiopia, by integrating several remote sensing products into a spatially distributed and physically explicit model, Topkapi-ETH. Lake Turkana is the world largest desert lake draining a catchment of 145'500 km2. It has three main contributing rivers: the Omo river, which contributes most of the annual lake inflow, the Turkwel river, and the Kerio rivers, which contribute the remaining part. The lake levels have shown great variations in the last decades due to long-term climate fluctuations and the regulation of three reservoirs, Gibe I, II, and III, which significantly alter the hydrological seasonality. Another large reservoir is planned and may be built in the next decade, generating concerns about the fate of Lake Turkana in the long run because of this additional anthropogenic pressure and increasing evaporation driven by climate change. We consider different remote sensing datasets, i.e., TRMM-V7 for precipitation, MERRA-2 for temperature, as inputs to the spatially distributed hydrological model. We validate the simulation results with other remote sensing datasets, i.e., GRACE for total water storage anomalies, GLDAS-NOAH for soil moisture, ERA-Interim/Land for surface runoff, and TOPEX/Poseidon for satellite altimetry data. Results highlight how different remote sensing products can be integrated into a hydrological modeling framework accounting for their relative uncertainties. We also carried out simulations with the artificial reservoirs planned in the north part of the catchment and without any reservoirs, to assess their impacts on the catchment hydrological
Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

Science.gov (United States)

Lary, D. J.

2013-12-01

A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.
Dataset for Testing Contamination Source Identification Methods for Water Distribution Networks

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset includes the results of a simulation study using the source inversion techniques available in the Water Security Toolkit. The data was created to test...
Robust computational analysis of rRNA hypervariable tag datasets.

Directory of Open Access Journals (Sweden)

Maksim Sipos

Full Text Available Next-generation DNA sequencing is increasingly being utilized to probe microbial communities, such as gastrointestinal microbiomes, where it is important to be able to quantify measures of abundance and diversity. The fragmented nature of the 16S rRNA datasets obtained, coupled with their unprecedented size, has led to the recognition that the results of such analyses are potentially contaminated by a variety of artifacts, both experimental and computational. Here we quantify how multiple alignment and clustering errors contribute to overestimates of abundance and diversity, reflected by incorrect OTU assignment, corrupted phylogenies, inaccurate species diversity estimators, and rank abundance distribution functions. We show that straightforward procedural optimizations, combining preexisting tools, are effective in handling large (10(5-10(6 16S rRNA datasets, and we describe metrics to measure the effectiveness and quality of the estimators obtained. We introduce two metrics to ascertain the quality of clustering of pyrosequenced rRNA data, and show that complete linkage clustering greatly outperforms other widely used methods.
Opportunities for multivariate analysis of open spatial datasets to characterize urban flooding risks

Science.gov (United States)

Gaitan, S.; ten Veldhuis, J. A. E.

2015-06-01

Cities worldwide are challenged by increasing urban flood risks. Precise and realistic measures are required to reduce flooding impacts. However, currently implemented sewer and topographic models do not provide realistic predictions of local flooding occurrence during heavy rain events. Assessing other factors such as spatially distributed rainfall, socioeconomic characteristics, and social sensing, may help to explain probability and impacts of urban flooding. Several spatial datasets have been recently made available in the Netherlands, including rainfall-related incident reports made by citizens, spatially distributed rain depths, semidistributed socioeconomic information, and buildings age. Inspecting the potential of this data to explain the occurrence of rainfall related incidents has not been done yet. Multivariate analysis tools for describing communities and environmental patterns have been previously developed and used in the field of study of ecology. The objective of this paper is to outline opportunities for these tools to explore urban flooding risks patterns in the mentioned datasets. To that end, a cluster analysis is performed. Results indicate that incidence of rainfall-related impacts is higher in areas characterized by older infrastructure and higher population density.
Extracting Prior Distributions from a Large Dataset of In-Situ Measurements to Support SWOT-based Estimation of River Discharge

Science.gov (United States)

Hagemann, M.; Gleason, C. J.

2017-12-01

The upcoming (2021) Surface Water and Ocean Topography (SWOT) NASA satellite mission aims, in part, to estimate discharge on major rivers worldwide using reach-scale measurements of stream width, slope, and height. Current formalizations of channel and floodplain hydraulics are insufficient to fully constrain this problem mathematically, resulting in an infinitely large solution set for any set of satellite observations. Recent work has reformulated this problem in a Bayesian statistical setting, in which the likelihood distributions derive directly from hydraulic flow-law equations. When coupled with prior distributions on unknown flow-law parameters, this formulation probabilistically constrains the parameter space, and results in a computationally tractable description of discharge. Using a curated dataset of over 200,000 in-situ acoustic Doppler current profiler (ADCP) discharge measurements from over 10,000 USGS gaging stations throughout the United States, we developed empirical prior distributions for flow-law parameters that are not observable by SWOT, but that are required in order to estimate discharge. This analysis quantified prior uncertainties on quantities including cross-sectional area, at-a-station hydraulic geometry width exponent, and discharge variability, that are dependent on SWOT-observable variables including reach-scale statistics of width and height. When compared against discharge estimation approaches that do not use this prior information, the Bayesian approach using ADCP-derived priors demonstrated consistently improved performance across a range of performance metrics. This Bayesian approach formally transfers information from in-situ gaging stations to remote-sensed estimation of discharge, in which the desired quantities are not directly observable. Further investigation using large in-situ datasets is therefore a promising way forward in improving satellite-based estimates of river discharge.
Data-driven decision support for radiologists: re-using the National Lung Screening Trial dataset for pulmonary nodule management.

Science.gov (United States)

Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L

2015-02-01

Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.
Karna Particle Size Dataset for Tables and Figures

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset contains 1) table of bulk Pb-XAS LCF results, 2) table of bulk As-XAS LCF results, 3) figure data of particle size distribution, and 4) figure data for...
SIFlore, a dataset of geographical distribution of vascular plants covering five centuries of knowledge in France: Results of a collaborative project coordinated by the Federation of the National Botanical Conservatories.

Science.gov (United States)

Just, Anaïs; Gourvil, Johan; Millet, Jérôme; Boullet, Vincent; Milon, Thomas; Mandon, Isabelle; Dutrève, Bruno

2015-01-01

More than 20 years ago, the French Muséum National d'Histoire Naturelle (MNHN, Secretariat of the Fauna and Flora) published the first part of an atlas of the flora of France at a 20km spatial resolution, accounting for 645 taxa (Dupont 1990). Since then, at the national level, there has not been any work on this scale relating to flora distribution, despite the obvious need for a better understanding. In 2011, in response to this need, the Federation des Conservatoires Botaniques Nationaux (FCBN, http://www.fcbn.fr) launched an ambitious collaborative project involving eleven national botanical conservatories of France. The project aims to establish a formal procedure and standardized system for data hosting, aggregation and publication for four areas: flora, fungi, vegetation and habitats. In 2014, the first phase of the project led to the development of the national flora dataset: SIFlore. As it includes about 21 million records of flora occurrences, this is currently the most comprehensive dataset on the distribution of vascular plants (Tracheophyta) in the French territory. SIFlore contains information for about 15'454 plant taxa occurrences (indigenous and alien taxa) in metropolitan France and Reunion Island, from 1545 until 2014. The data records were originally collated from inventories, checklists, literature and herbarium records. SIFlore was developed by assembling flora datasets from the regional to the national level. At the regional level, source records are managed by the national botanical conservatories that are responsible for flora data collection and validation. In order to present our results, a geoportal was developed by the Fédération des conservatoires botaniques nationaux that allows the SIFlore dataset to be publically viewed. This portal is available at: http://siflore.fcbn.fr. As the FCBN belongs to the Information System for Nature and Landscapes' (SINP), a governmental program, the dataset is also accessible through the websites of
Passive Containment DataSet

Science.gov (United States)

This data is for Figures 6 and 7 in the journal article. The data also includes the two EPANET input files used for the analysis described in the paper, one for the looped system and one for the block system.This dataset is associated with the following publication:Grayman, W., R. Murray , and D. Savic. Redesign of Water Distribution Systems for Passive Containment of Contamination. JOURNAL OF THE AMERICAN WATER WORKS ASSOCIATION. American Water Works Association, Denver, CO, USA, 108(7): 381-391, (2016).
Opportunities for multivariate analysis of open spatial datasets to characterize urban flooding risks

Directory of Open Access Journals (Sweden)

S. Gaitan

2015-06-01

Full Text Available Cities worldwide are challenged by increasing urban flood risks. Precise and realistic measures are required to reduce flooding impacts. However, currently implemented sewer and topographic models do not provide realistic predictions of local flooding occurrence during heavy rain events. Assessing other factors such as spatially distributed rainfall, socioeconomic characteristics, and social sensing, may help to explain probability and impacts of urban flooding. Several spatial datasets have been recently made available in the Netherlands, including rainfall-related incident reports made by citizens, spatially distributed rain depths, semidistributed socioeconomic information, and buildings age. Inspecting the potential of this data to explain the occurrence of rainfall related incidents has not been done yet. Multivariate analysis tools for describing communities and environmental patterns have been previously developed and used in the field of study of ecology. The objective of this paper is to outline opportunities for these tools to explore urban flooding risks patterns in the mentioned datasets. To that end, a cluster analysis is performed. Results indicate that incidence of rainfall-related impacts is higher in areas characterized by older infrastructure and higher population density.
BAYESIAN MODELS FOR SPECIES DISTRIBUTION MODELLING WITH ONLY-PRESENCE RECORDS

Directory of Open Access Journals (Sweden)

Bartolo de JesÃºs Villar-HernÃ¡ndez

2015-08-01

Full Text Available One of the central issues in ecology is the study of geographical distribution of species of flora and fauna through Species Distribution Models (SDM. Recently, scientific interest has focused on presence-only records. Two recent approaches have been proposed for this problem: a model based on maximum likelihood method (Maxlike and an inhomogeneous poisson process model (IPP. In this paper we discussed two bayesian approaches called MaxBayes and IPPBayes based on Maxlike and IPP model, respectively. To illustrate these proposals, we implemented two study examples: (1 both models were implemented on a simulated dataset, and (2 we modeled the potencial distribution of genus Dalea in the Tehuacan-CuicatlÃ¡n biosphere reserve with both models, the results was compared with that of Maxent. The results show that both models, MaxBayes and IPPBayes, are viable alternatives when species distribution are modeled with only-presence records. For simulated dataset, MaxBayes achieved prevalence estimation, even when the number of records was small. In the real dataset example, both models predict similar potential distributions like Maxent does. Â
A dataset of forest biomass structure for Eurasia.

Science.gov (United States)

Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

2017-05-16

The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.

Process mining in oncology using the MIMIC-III dataset

Science.gov (United States)

Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen

2018-03-01

Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.
Assessing Potential Wind Energy Resources in Saudi Arabia with a Skew-t Distribution

KAUST Repository

Tagle, Felipe

2017-03-13

Facing increasing domestic energy consumption from population growth and industrialization, Saudi Arabia is aiming to reduce its reliance on fossil fuels and to broaden its energy mix by expanding investment in renewable energy sources, including wind energy. A preliminary task in the development of wind energy infrastructure is the assessment of wind energy potential, a key aspect of which is the characterization of its spatio-temporal behavior. In this study we examine the impact of internal climate variability on seasonal wind power density fluctuations using 30 simulations from the Large Ensemble Project (LENS) developed at the National Center for Atmospheric Research. Furthermore, a spatio-temporal model for daily wind speed is proposed with neighbor-based cross-temporal dependence, and a multivariate skew-t distribution to capture the spatial patterns of higher order moments. The model can be used to generate synthetic time series over the entire spatial domain that adequately reproduces the internal variability of the LENS dataset.
Assessing Potential Wind Energy Resources in Saudi Arabia with a Skew-t Distribution

KAUST Repository

Tagle, Felipe; Castruccio, Stefano; Crippa, Paola; Genton, Marc G.

2017-01-01

Facing increasing domestic energy consumption from population growth and industrialization, Saudi Arabia is aiming to reduce its reliance on fossil fuels and to broaden its energy mix by expanding investment in renewable energy sources, including wind energy. A preliminary task in the development of wind energy infrastructure is the assessment of wind energy potential, a key aspect of which is the characterization of its spatio-temporal behavior. In this study we examine the impact of internal climate variability on seasonal wind power density fluctuations using 30 simulations from the Large Ensemble Project (LENS) developed at the National Center for Atmospheric Research. Furthermore, a spatio-temporal model for daily wind speed is proposed with neighbor-based cross-temporal dependence, and a multivariate skew-t distribution to capture the spatial patterns of higher order moments. The model can be used to generate synthetic time series over the entire spatial domain that adequately reproduces the internal variability of the LENS dataset.
An innovative privacy preserving technique for incremental datasets on cloud computing.

Science.gov (United States)

Aldeen, Yousra Abdul Alsahib S; Salleh, Mazleena; Aljeroudi, Yazan

2016-08-01

Cloud computing (CC) is a magnificent service-based delivery with gigantic computer processing power and data storage across connected communications channels. It imparted overwhelming technological impetus in the internet (web) mediated IT industry, where users can easily share private data for further analysis and mining. Furthermore, user affable CC services enable to deploy sundry applications economically. Meanwhile, simple data sharing impelled various phishing attacks and malware assisted security threats. Some privacy sensitive applications like health services on cloud that are built with several economic and operational benefits necessitate enhanced security. Thus, absolute cyberspace security and mitigation against phishing blitz became mandatory to protect overall data privacy. Typically, diverse applications datasets are anonymized with better privacy to owners without providing all secrecy requirements to the newly added records. Some proposed techniques emphasized this issue by re-anonymizing the datasets from the scratch. The utmost privacy protection over incremental datasets on CC is far from being achieved. Certainly, the distribution of huge datasets volume across multiple storage nodes limits the privacy preservation. In this view, we propose a new anonymization technique to attain better privacy protection with high data utility over distributed and incremental datasets on CC. The proficiency of data privacy preservation and improved confidentiality requirements is demonstrated through performance evaluation. Copyright © 2016 Elsevier Inc. All rights reserved.
Blood vessel-based liver segmentation through the portal phase of a CT dataset

Science.gov (United States)

Maklad, Ahmed S.; Matsuhiro, Mikio; Suzuki, Hidenobu; Kawata, Yoshiki; Niki, Noboru; Moriyama, Noriyuki; Utsunomiya, Toru; Shimada, Mitsuo

2013-02-01

Blood vessels are dispersed throughout the human body organs and carry unique information for each person. This information can be used to delineate organ boundaries. The proposed method relies on abdominal blood vessels (ABV) to segment the liver considering the potential presence of tumors through the portal phase of a CT dataset. ABV are extracted and classified into hepatic (HBV) and nonhepatic (non-HBV) with a small number of interactions. HBV and non-HBV are used to guide an automatic segmentation of the liver. HBV are used to individually segment the core region of the liver. This region and non-HBV are used to construct a boundary surface between the liver and other organs to separate them. The core region is classified based on extracted posterior distributions of its histogram into low intensity tumor (LIT) and non-LIT core regions. Non-LIT case includes normal part of liver, HBV, and high intensity tumors if exist. Each core region is extended based on its corresponding posterior distribution. Extension is completed when it reaches either a variation in intensity or the constructed boundary surface. The method was applied to 80 datasets (30 Medical Image Computing and Computer Assisted Intervention (MICCAI) and 50 non-MICCAI data) including 60 datasets with tumors. Our results for the MICCAI-test data were evaluated by sliver07 [1] with an overall score of 79.7, which ranks seventh best on the site (December 2013). This approach seems a promising method for extraction of liver volumetry of various shapes and sizes and low intensity hepatic tumors.
Potential geographic distribution and conservation of Audubon's Shearwater, Puffinus lherminieri in Brazil

Directory of Open Access Journals (Sweden)

Ana Cecília P.A. Lopes

2014-01-01

Full Text Available Audubon's Shearwater (Puffinus lherminieri Lesson 1839 is a tropical seabird occurring mainly between southern Canada and the southeast coast of Brazil. Puffinus lherminieri is considered Critically Endangered on the Brazilian Red List because it only occurs in two known localities, both of which contain very small populations. However, many offshore islands along the Brazilian coast are poorly known and the discovery of new colonies would be of considerable significance for the conservation of this species. The aim of this study was to estimate the potential geographic distribution of Audubon's Shearwater in Brazil, based on ecological niche model (ENM using Maxent algorithm with layers obtained from AquaMaps environmental dataset. The ENM was based on 37 records for reproduction areas in North and South America. The model yielded a very broad potential distribution, covering most of the Atlantic coast ranging from Brazil to the US. When filtered for islands along the Brazilian coast, the model indicates higher levels of environmental suitability near the states of São Paulo, Rio de Janeiro, Espírito Santo and Bahia. However, P. lherminieri prefers islands in environments with warm saline water. Thus, based on the influence of currents that act on the Brazilian coast we can infer undiscovered colonies are most likely to occur on islands on coast of Bahia, Espírito Santo and extreme north of the Rio de Janeiro. These should be intensively surveyed while the islands south of Cabo Frio should be discarded. The existence of new populations would have profound effects on the conservation status of this enigmatic and rarely seen seabird.
Potential distribution of a nonuniformly charged ellipsoid

International Nuclear Information System (INIS)

Kiwamoto, Y.; Aoki, J.; Soga, Y.

2004-01-01

A convenient formula is obtained for fast calculation of the three-dimensional potential distribution associated with a spatially varying charge-density distribution by reconstructing it as a superposed set of nested spheroidal shells. It is useful for experimental analyses of near-equilibrium states of non-neutral plasmas and for quick evaluation of the gravity field associated with stellar mass distributions
Comparison of global 3-D aviation emissions datasets

Directory of Open Access Journals (Sweden)

S. C. Olsen

2013-01-01

Full Text Available Aviation emissions are unique from other transportation emissions, e.g., from road transportation and shipping, in that they occur at higher altitudes as well as at the surface. Aviation emissions of carbon dioxide, soot, and water vapor have direct radiative impacts on the Earth's climate system while emissions of nitrogen oxides (NO_x, sulfur oxides, carbon monoxide (CO, and hydrocarbons (HC impact air quality and climate through their effects on ozone, methane, and clouds. The most accurate estimates of the impact of aviation on air quality and climate utilize three-dimensional chemistry-climate models and gridded four dimensional (space and time aviation emissions datasets. We compare five available aviation emissions datasets currently and historically used to evaluate the impact of aviation on climate and air quality: NASA-Boeing 1992, NASA-Boeing 1999, QUANTIFY 2000, Aero2k 2002, and AEDT 2006 and aviation fuel usage estimates from the International Energy Agency. Roughly 90% of all aviation emissions are in the Northern Hemisphere and nearly 60% of all fuelburn and NO_x emissions occur at cruise altitudes in the Northern Hemisphere. While these datasets were created by independent methods and are thus not strictly suitable for analyzing trends they suggest that commercial aviation fuelburn and NO_x emissions increased over the last two decades while HC emissions likely decreased and CO emissions did not change significantly. The bottom-up estimates compared here are consistently lower than International Energy Agency fuelburn statistics although the gap is significantly smaller in the more recent datasets. Overall the emissions distributions are quite similar for fuelburn and NO_x with regional peaks over the populated land masses of North America, Europe, and East Asia. For CO and HC there are relatively larger differences. There are however some distinct differences in the altitude distribution
Homogenised Australian climate datasets used for climate change monitoring

International Nuclear Information System (INIS)

Trewin, Blair; Jones, David; Collins; Dean; Jovanovic, Branislava; Braganza, Karl

2007-01-01

Full text: The Australian Bureau of Meteorology has developed a number of datasets for use in climate change monitoring. These datasets typically cover 50-200 stations distributed as evenly as possible over the Australian continent, and have been subject to detailed quality control and homogenisation.The time period over which data are available for each element is largely determined by the availability of data in digital form. Whilst nearly all Australian monthly and daily precipitation data have been digitised, a significant quantity of pre-1957 data (for temperature and evaporation) or pre-1987 data (for some other elements) remains to be digitised, and is not currently available for use in the climate change monitoring datasets. In the case of temperature and evaporation, the start date of the datasets is also determined by major changes in instruments or observing practices for which no adjustment is feasible at the present time. The datasets currently available cover: Monthly and daily precipitation (most stations commence 1915 or earlier, with many extending back to the late 19th century, and a few to the mid-19th century); Annual temperature (commences 1910); Daily temperature (commences 1910, with limited station coverage pre-1957); Twice-daily dewpoint/relative humidity (commences 1957); Monthly pan evaporation (commences 1970); Cloud amount (commences 1957) (Jovanovic etal. 2007). As well as the station-based datasets listed above, an additional dataset being developed for use in climate change monitoring (and other applications) covers tropical cyclones in the Australian region. This is described in more detail in Trewin (2007). The datasets already developed are used in analyses of observed climate change, which are available through the Australian Bureau of Meteorology website (http://www.bom.gov.au/silo/products/cli_chg/). They are also used as a basis for routine climate monitoring, and in the datasets used for the development of seasonal
A high-resolution European dataset for hydrologic modeling

Science.gov (United States)

Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

2013-04-01

There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as
Comparision of analysis of the QTLMAS XII common dataset

DEFF Research Database (Denmark)

Lund, Mogens Sandø; Sahana, Goutam; de Koning, Dirk-Jan

2009-01-01

A dataset was simulated and distributed to participants of the QTLMAS XII workshop who were invited to develop genomic selection models. Each contributing group was asked to describe the model development and validation as well as to submit genomic predictions for three generations of individuals...
Geochemical Fingerprinting of Coltan Ores by Machine Learning on Uneven Datasets

International Nuclear Information System (INIS)

Savu-Krohn, Christian; Rantitsch, Gerd; Auer, Peter; Melcher, Frank; Graupner, Torsten

2011-01-01

Two modern machine learning techniques, Linear Programming Boosting (LPBoost) and Support Vector Machines (SVMs), are introduced and applied to a geochemical dataset of niobium–tantalum (“coltan”) ores from Central Africa to demonstrate how such information may be used to distinguish ore provenance, i.e., place of origin. The compositional data used include uni- and multivariate outliers and elemental distributions are not described by parametric frequency distribution functions. The “soft margin” techniques of LPBoost and SVMs can be applied to such data. Optimization of their learning parameters results in an average accuracy of up to c. 92%, if spot measurements are assessed to estimate the provenance of ore samples originating from two geographically defined source areas. A parameterized performance measure, together with common methods for its optimization, was evaluated to account for the presence of uneven datasets. Optimization of the classification function threshold improves the performance, as class importance is shifted towards one of those classes. For this dataset, the average performance of the SVMs is significantly better compared to that of LPBoost.
The effects of spatial population dataset choice on estimates of population at risk of disease

Directory of Open Access Journals (Sweden)

Gething Peter W

2011-02-01

Full Text Available Abstract Background The spatial modeling of infectious disease distributions and dynamics is increasingly being undertaken for health services planning and disease control monitoring, implementation, and evaluation. Where risks are heterogeneous in space or dependent on person-to-person transmission, spatial data on human population distributions are required to estimate infectious disease risks, burdens, and dynamics. Several different modeled human population distribution datasets are available and widely used, but the disparities among them and the implications for enumerating disease burdens and populations at risk have not been considered systematically. Here, we quantify some of these effects using global estimates of populations at risk (PAR of P. falciparum malaria as an example. Methods The recent construction of a global map of P. falciparum malaria endemicity enabled the testing of different gridded population datasets for providing estimates of PAR by endemicity class. The estimated population numbers within each class were calculated for each country using four different global gridded human population datasets: GRUMP (~1 km spatial resolution, LandScan (~1 km, UNEP Global Population Databases (~5 km, and GPW3 (~5 km. More detailed assessments of PAR variation and accuracy were conducted for three African countries where census data were available at a higher administrative-unit level than used by any of the four gridded population datasets. Results The estimates of PAR based on the datasets varied by more than 10 million people for some countries, even accounting for the fact that estimates of population totals made by different agencies are used to correct national totals in these datasets and can vary by more than 5% for many low-income countries. In many cases, these variations in PAR estimates comprised more than 10% of the total national population. The detailed country-level assessments suggested that none of the datasets was
An assessment of differences in gridded precipitation datasets in complex terrain

Science.gov (United States)

Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.

2018-01-01

Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.
Two-dimensional potential and charge distributions of positive surface streamer

International Nuclear Information System (INIS)

Tanaka, Daiki; Matsuoka, Shigeyasu; Kumada, Akiko; Hidaka, Kunihiko

2009-01-01

Information on the potential and the field profile along a surface discharge is required for quantitatively discussing and clarifying the propagation mechanism. The sensing technique with a Pockels crystal has been developed for directly measuring the potential and electric field distribution on a dielectric material. In this paper, the Pockels sensing system consists of a pulse laser and a CCD camera for measuring the instantaneous two-dimensional potential distribution on a 25.4 mm square area with a 50 μm sampling pitch. The temporal resolution is 3.2 ns which is determined by the pulse width of the laser emission. The transient change in the potential distribution of a positive surface streamer propagating in atmospheric air is measured with this system. The electric field and the charge distributions are also calculated from the measured potential profile. The propagating direction component of the electric field near the tip of the propagating streamer reaches 3 kV mm -1 . When the streamer stops, the potential distribution along a streamer forms an almost linear profile with the distance from the electrode, and its gradient is about 0.5 kV mm -1 .
Distributed computing strategies for processing of FT-ICR MS imaging datasets for continuous mode data visualization

Energy Technology Data Exchange (ETDEWEB)

Smith, Donald F.; Schulz, Carl; Konijnenburg, Marco; Kilic, Mehmet; Heeren, Ronald M.

2015-03-01

High-resolution Fourier transform ion cyclotron resonance (FT-ICR) mass spectrometry imaging enables the spatial mapping and identification of biomolecules from complex surfaces. The need for long time-domain transients, and thus large raw file sizes, results in a large amount of raw data (“big data”) that must be processed efficiently and rapidly. This can be compounded by largearea imaging and/or high spatial resolution imaging. For FT-ICR, data processing and data reduction must not compromise the high mass resolution afforded by the mass spectrometer. The continuous mode “Mosaic Datacube” approach allows high mass resolution visualization (0.001 Da) of mass spectrometry imaging data, but requires additional processing as compared to featurebased processing. We describe the use of distributed computing for processing of FT-ICR MS imaging datasets with generation of continuous mode Mosaic Datacubes for high mass resolution visualization. An eight-fold improvement in processing time is demonstrated using a Dutch nationally available cloud service.
The Path from Large Earth Science Datasets to Information

Science.gov (United States)

Vicente, G. A.

2013-12-01

The NASA Goddard Earth Sciences Data (GES) and Information Services Center (DISC) is one of the major Science Mission Directorate (SMD) for archiving and distribution of Earth Science remote sensing data, products and services. This virtual portal provides convenient access to Atmospheric Composition and Dynamics, Hydrology, Precipitation, Ozone, and model derived datasets (generated by GSFC's Global Modeling and Assimilation Office), the North American Land Data Assimilation System (NLDAS) and the Global Land Data Assimilation System (GLDAS) data products (both generated by GSFC's Hydrological Sciences Branch). This presentation demonstrates various tools and computational technologies developed in the GES DISC to manage the huge volume of data and products acquired from various missions and programs over the years. It explores approaches to archive, document, distribute, access and analyze Earth Science data and information as well as addresses the technical and scientific issues, governance and user support problem faced by scientists in need of multi-disciplinary datasets. It also discusses data and product metrics, user distribution profiles and lessons learned through interactions with the science communities around the world. Finally it demonstrates some of the most used data and product visualization and analyses tools developed and maintained by the GES DISC.
Cross-Cultural Concept Mapping of Standardized Datasets

DEFF Research Database (Denmark)

Kano Glückstad, Fumiko

2012-01-01

This work compares four feature-based similarity measures derived from cognitive sciences. The purpose of the comparative analysis is to verify the potentially most effective model that can be applied for mapping independent ontologies in a culturally influenced domain [1]. Here, datasets based...
Application of global datasets for hydrological modelling of a remote, snowmelt driven catchment in the Canadian Sub-Arctic

Science.gov (United States)

Casson, David; Werner, Micha; Weerts, Albrecht; Schellekens, Jaap; Solomatine, Dimitri

2017-04-01

Hydrological modelling in the Canadian Sub-Arctic is hindered by the limited spatial and temporal coverage of local meteorological data. Local watershed modelling often relies on data from a sparse network of meteorological stations with a rough density of 3 active stations per 100,000 km2. Global datasets hold great promise for application due to more comprehensive spatial and extended temporal coverage. A key objective of this study is to demonstrate the application of global datasets and data assimilation techniques for hydrological modelling of a data sparse, Sub-Arctic watershed. Application of available datasets and modelling techniques is currently limited in practice due to a lack of local capacity and understanding of available tools. Due to the importance of snow processes in the region, this study also aims to evaluate the performance of global SWE products for snowpack modelling. The Snare Watershed is a 13,300 km2 snowmelt driven sub-basin of the Mackenzie River Basin, Northwest Territories, Canada. The Snare watershed is data sparse in terms of meteorological data, but is well gauged with consistent discharge records since the late 1970s. End of winter snowpack surveys have been conducted every year from 1978-present. The application of global re-analysis datasets from the EU FP7 eartH2Observe project are investigated in this study. Precipitation data are taken from Multi-Source Weighted-Ensemble Precipitation (MSWEP) and temperature data from Watch Forcing Data applied to European Reanalysis (ERA)-Interim data (WFDEI). GlobSnow-2 is a global Snow Water Equivalent (SWE) measurement product funded by the European Space Agency (ESA) and is also evaluated over the local watershed. Downscaled precipitation, temperature and potential evaporation datasets are used as forcing data in a distributed version of the HBV model implemented in the WFLOW framework. Results demonstrate the successful application of global datasets in local watershed modelling, but
EPA Nanorelease Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — EPA Nanorelease Dataset. This dataset is associated with the following publication: Wohlleben, W., C. Kingston, J. Carter, E. Sahle-Demessie, S. Vazquez-Campos, B....

Comparision of analysis of the QTLMAS XII common dataset

DEFF Research Database (Denmark)

Crooks, Lucy; Sahana, Goutam; de Koning, Dirk-Jan

2009-01-01

As part of the QTLMAS XII workshop, a simulated dataset was distributed and participants were invited to submit analyses of the data based on genome-wide association, fine mapping and genomic selection. We have evaluated the findings from the groups that reported fine mapping and genome-wide asso...
Climatic Analysis of Oceanic Water Vapor Transports Based on Satellite E-P Datasets

Science.gov (United States)

Smith, Eric A.; Sohn, Byung-Ju; Mehta, Vikram

2004-01-01

Understanding the climatically varying properties of water vapor transports from a robust observational perspective is an essential step in calibrating climate models. This is tantamount to measuring year-to-year changes of monthly- or seasonally-averaged, divergent water vapor transport distributions. This cannot be done effectively with conventional radiosonde data over ocean regions where sounding data are generally sparse. This talk describes how a methodology designed to derive atmospheric water vapor transports over the world oceans from satellite-retrieved precipitation (P) and evaporation (E) datasets circumvents the problem of inadequate sampling. Ultimately, the method is intended to take advantage of the relatively complete and consistent coverage, as well as continuity in sampling, associated with E and P datasets obtained from satellite measurements. Independent P and E retrievals from Special Sensor Microwave Imager (SSM/I) measurements, along with P retrievals from Tropical Rainfall Measuring Mission (TRMM) measurements, are used to obtain transports by solving a potential function for the divergence of water vapor transport as balanced by large scale E - P conditions.
Improving AfriPop dataset with settlement extents extracted from RapidEye for the border region comprising South-Africa, Swaziland and Mozambique

Directory of Open Access Journals (Sweden)

Julie Deleu

2015-11-01

Full Text Available For modelling the spatial distribution of malaria incidence, accurate and detailed information on population size and distribution are of significant importance. Different, global, spatial, standard datasets of population distribution have been developed and are widely used. However, most of them are not up-to-date and the low spatial resolution of the input census data has limitations for contemporary, national- scale analyses. The AfriPop project, launched in July 2009, was initiated with the aim of producing detailed, contemporary and easily updatable population distribution datasets for the whole of Africa. High-resolution satellite sensors can help to further improve this dataset through the generation of high-resolution settlement layers at greater spatial details. In the present study, the settlement extents included in the MALAREO land use classification were used to generate an enhanced and updated version of the AfriPop dataset for the study area covering southern Mozambique, eastern Swaziland and the malarious part of KwaZulu-Natal in South Africa. Results show that it is possible to easily produce a detailed and updated population distribution dataset applying the AfriPop modelling approach with the use of high-resolution settlement layers and population growth rates. The 2007 and 2011 population datasets are freely available as a product of the MALAREO project and can be downloaded from the project website.
Would the ‘real’ observed dataset stand up? A critical examination of eight observed gridded climate datasets for China

International Nuclear Information System (INIS)

Sun, Qiaohong; Miao, Chiyuan; Duan, Qingyun; Kong, Dongxian; Ye, Aizhong; Di, Zhenhua; Gong, Wei

2014-01-01

This research compared and evaluated the spatio-temporal similarities and differences of eight widely used gridded datasets. The datasets include daily precipitation over East Asia (EA), the Climate Research Unit (CRU) product, the Global Precipitation Climatology Centre (GPCC) product, the University of Delaware (UDEL) product, Precipitation Reconstruction over Land (PREC/L), the Asian Precipitation Highly Resolved Observational (APHRO) product, the Institute of Atmospheric Physics (IAP) dataset from the Chinese Academy of Sciences, and the National Meteorological Information Center dataset from the China Meteorological Administration (CN05). The meteorological variables focus on surface air temperature (SAT) or precipitation (PR) in China. All datasets presented general agreement on the whole spatio-temporal scale, but some differences appeared for specific periods and regions. On a temporal scale, EA shows the highest amount of PR, while APHRO shows the lowest. CRU and UDEL show higher SAT than IAP or CN05. On a spatial scale, the most significant differences occur in western China for PR and SAT. For PR, the difference between EA and CRU is the largest. When compared with CN05, CRU shows higher SAT in the central and southern Northwest river drainage basin, UDEL exhibits higher SAT over the Southwest river drainage system, and IAP has lower SAT in the Tibetan Plateau. The differences in annual mean PR and SAT primarily come from summer and winter, respectively. Finally, potential factors impacting agreement among gridded climate datasets are discussed, including raw data sources, quality control (QC) schemes, orographic correction, and interpolation techniques. The implications and challenges of these results for climate research are also briefly addressed. (paper)
Interpolation of diffusion weighted imaging datasets

DEFF Research Database (Denmark)

Dyrby, Tim B; Lundell, Henrik; Burke, Mark W

2014-01-01

anatomical details and signal-to-noise-ratio for reliable fibre reconstruction. We assessed the potential benefits of interpolating DWI datasets to a higher image resolution before fibre reconstruction using a diffusion tensor model. Simulations of straight and curved crossing tracts smaller than or equal......Diffusion weighted imaging (DWI) is used to study white-matter fibre organisation, orientation and structural connectivity by means of fibre reconstruction algorithms and tractography. For clinical settings, limited scan time compromises the possibilities to achieve high image resolution for finer...... interpolation methods fail to disentangle fine anatomical details if PVE is too pronounced in the original data. As for validation we used ex-vivo DWI datasets acquired at various image resolutions as well as Nissl-stained sections. Increasing the image resolution by a factor of eight yielded finer geometrical...
Decoys Selection in Benchmarking Datasets: Overview and Perspectives

Science.gov (United States)

Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu

2018-01-01

Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509
Extraction of drainage networks from large terrain datasets using high throughput computing

Science.gov (United States)

Gong, Jianya; Xie, Jibo

2009-02-01

Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.
Proteomics dataset

DEFF Research Database (Denmark)

Bennike, Tue Bjerg; Carlsen, Thomas Gelsing; Ellingsen, Torkell

2017-01-01

The datasets presented in this article are related to the research articles entitled “Neutrophil Extracellular Traps in Ulcerative Colitis: A Proteome Analysis of Intestinal Biopsies” (Bennike et al., 2015 [1]), and “Proteome Analysis of Rheumatoid Arthritis Gut Mucosa” (Bennike et al., 2017 [2])...... been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD001608 for ulcerative colitis and control samples, and PXD003082 for rheumatoid arthritis samples....
Utilizing LiDAR Datasets From Experimental Watersheds to Advance Ecohydrological Understanding in Seasonally Snow-Covered Forests

Science.gov (United States)

Harpold, A. A.; Broxton, P. D.; Guo, Q.; Barlage, M. J.; Gochis, D. J.

2014-12-01

The Western U.S. is strongly reliant on snowmelt from forested areas for ecosystem services and downstream populations. The ability to manage water resources from snow-covered forests faces major challenges from drought, disturbance, and regional changes in climate. An exciting avenue for improving ecohydrological process understanding is Light Detection and Ranging (LiDAR) because the technology simultaneously observes topography, forest properties, and snow/ice at high-resolution (100 km2). The availability and quality of LiDAR datasets is increasing rapidly, however they remain under-utilized for process-based ecohydrology investigations. This presentation will illustrate how LiDAR datasets from the Critical Zone Observatory (CZO) network have been applied to advance ecohydrological understanding through direct empirical analysis, as well as model parameterization and verification. Direct analysis of the datasets has proved fruitful for pre- and post-disturbance snow distribution estimates and interpreting in-situ snow depth measurements across sites. In addition, we illustrate the potential value of LiDAR to parameterize and verify of physical models with two examples. First, we use LiDAR to parameterize a land surface model, Noah multi-parameterization (Noah-MP), to investigate the sensitivity of modeled water and energy fluxes to high-resolution forest information. Second, we present a Snow Physics and Laser Mapping (SnowPALM) model that is parameterized with LiDAR information at its native 1-m scale. Both modeling studies demonstrate the value of LiDAR for representing processes with greater fidelity. More importantly, the increased model fidelity led to different estimates of water and energy fluxes at larger, watershed scales. Creating a network of experimental watersheds with LiDAR datasets offers the potential to test theories and models in previously unexplored ways.
Resolution testing and limitations of geodetic and tsunami datasets for finite fault inversions along subduction zones

Science.gov (United States)

Williamson, A.; Newman, A. V.

2017-12-01

Finite fault inversions utilizing multiple datasets have become commonplace for large earthquakes pending data availability. The mixture of geodetic datasets such as Global Navigational Satellite Systems (GNSS) and InSAR, seismic waveforms, and when applicable, tsunami waveforms from Deep-Ocean Assessment and Reporting of Tsunami (DART) gauges, provide slightly different observations that when incorporated together lead to a more robust model of fault slip distribution. The merging of different datasets is of particular importance along subduction zones where direct observations of seafloor deformation over the rupture area are extremely limited. Instead, instrumentation measures related ground motion from tens to hundreds of kilometers away. The distance from the event and dataset type can lead to a variable degree of resolution, affecting the ability to accurately model the spatial distribution of slip. This study analyzes the spatial resolution attained individually from geodetic and tsunami datasets as well as in a combined dataset. We constrain the importance of distance between estimated parameters and observed data and how that varies between land-based and open ocean datasets. Analysis focuses on accurately scaled subduction zone synthetic models as well as analysis of the relationship between slip and data in recent large subduction zone earthquakes. This study shows that seafloor deformation sensitive datasets, like open-ocean tsunami waveforms or seafloor geodetic instrumentation, can provide unique offshore resolution for understanding most large and particularly tsunamigenic megathrust earthquake activity. In most environments, we simply lack the capability to resolve static displacements using land-based geodetic observations.
Effects of water chemistry and potential distribution on electrochemical corrosion potential measurements in 553 K pure water

International Nuclear Information System (INIS)

Ishida, Kazushige; Wada, Yoichi; Tachibana, Masahiko; Ota, Nobuyuki; Aizawa, Motohiro

2013-01-01

The effects of water chemistry distribution on the potential of a reference electrode and of the potential distribution on the measured potential should be known qualitatively to obtain accurate electrochemical corrosion potential (ECP) data in BWRs. First, the effects of oxygen on a platinum reference electrode were studied in 553 K pure water containing dissolved hydrogen (DH) concentration of 26 - 10 5 μg kg -1 (ppb). The platinum electrode worked in the same way as the theoretical hydrogen electrode under the condition that the molar ratio of DH to dissolved oxygen (DO) was more than 10 and that DO was less than 100 ppb. Second, the effects of potential distribution on the measured potential were studied by using the ECP measurement part without platinum deposition on the surfaces connected to another ECP measurement part with platinum deposition on the surfaces in 553 K pure water containing 100 - 130 ppb of DH or 100 - 130 ppb of DH plus 400 ppb of hydrogen peroxide. Measured potentials for each ECP measurement part were in good agreement with literature data for each surface condition. The lead wire connecting point did not affect the measured potential. Potential should be measured at the nearest point from the reference electrode in which case it will be not affected by either the potential distribution or the connection point of the lead wire in pure water. (author)
Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

Directory of Open Access Journals (Sweden)

Karacali Bilge

2007-10-01

Full Text Available Abstract Background Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal samples linked to clinical information with an iterative machine learning algorithm. ROC curves were used to assess the prediction error of each profile for classification. We compared the prediction error of profiles correlated with molecular phenotype against profiles correlated with relapse-free status. Prediction error of profiles identified with supervised univariate feature selection algorithms were compared to profiles selected randomly from a all genes on the microarray platform and b a list of known disease-related genes (a priori selection. We also determined the relevance of expression profiles on test arrays from independent datasets, measured on either the same or different microarray platforms. Results Highly discriminative expression profiles were produced on both simulated gene expression data and expression data from breast cancer and lymphoma datasets on the basis of ER and BCL-6 expression, respectively. Use of relapse-free status to identify profiles for prognosis prediction resulted in poorly discriminative decision rules. Supervised feature selection resulted in more accurate classifications than random or a priori selection, however, the difference in prediction error decreased as the number of features increased. These results held when decision rules were applied across-datasets to samples profiled on the same microarray platform. Conclusion Our results show that many gene sets predict molecular phenotypes accurately. Given this, expression profiles identified using different training datasets should be expected to show little agreement. In addition, we demonstrate the difficulty in predicting relapse directly from microarray data using supervised machine
Distribution of electric potential in hydrocarbon flames

Energy Technology Data Exchange (ETDEWEB)

Fialkov, B.S.; Shcherbakov, N.D.; Plitsyn, V.T.

1978-01-01

A study was made of the distribution of electrical potential and temperatures in laminar methane and propane--butane flames when the excess air coefficient in the mixture is changed from 0 to 1.2. 7 references, 3 figures.
The measurement of potential distribution of plasma in MM-4 fusion device

International Nuclear Information System (INIS)

Tian Zhongyu; Ming Linzhou; Feng Xiaozhen; Feng Chuntang; Yi Youjun; Wang Jihai; Liu Yihua

1988-11-01

Some experimental results of the potential distribution in MM-4 fusion device are presented by measuring the floating potential of probe. The results showed that the distribution of axial potential is asymmetrical, but the radial potential is symmetrical. There are double ion potential wells in the plasma. The depth of the deepest potential well become deeper is the strength of the magnetic field and injection current are increasing. The location of the deepest well is moved towards the device center along with the increasing of injection energy. This is different from others results. The mechanism of causing this distribution in also discussed
Near term climate projections for invasive species distributions

Science.gov (United States)

Jarnevich, C.S.; Stohlgren, T.J.

2009-01-01

Climate change and invasive species pose important conservation issues separately, and should be examined together. We used existing long term climate datasets for the US to project potential climate change into the future at a finer spatial and temporal resolution than the climate change scenarios generally available. These fine scale projections, along with new species distribution modeling techniques to forecast the potential extent of invasive species, can provide useful information to aide conservation and invasive species management efforts. We created habitat suitability maps for Pueraria montana (kudzu) under current climatic conditions and potential average conditions up to 30 years in the future. We examined how the potential distribution of this species will be affected by changing climate, and the management implications associated with these changes. Our models indicated that P. montana may increase its distribution particularly in the Northeast with climate change and may decrease in other areas. ?? 2008 Springer Science+Business Media B.V.
A statistical comparison of cirrus particle size distributions measured using the 2-D stereo probe during the TC4, SPARTICUS, and MACPEX flight campaigns with historical cirrus datasets

Science.gov (United States)

Schwartz, M. Christian

2017-08-01

This paper addresses two straightforward questions. First, how similar are the statistics of cirrus particle size distribution (PSD) datasets collected using the Two-Dimensional Stereo (2D-S) probe to cirrus PSD datasets collected using older Particle Measuring Systems (PMS) 2-D Cloud (2DC) and 2-D Precipitation (2DP) probes? Second, how similar are the datasets when shatter-correcting post-processing is applied to the 2DC datasets? To answer these questions, a database of measured and parameterized cirrus PSDs - constructed from measurements taken during the Small Particles in Cirrus (SPARTICUS); Mid-latitude Airborne Cirrus Properties Experiment (MACPEX); and Tropical Composition, Cloud, and Climate Coupling (TC4) flight campaigns - is used.Bulk cloud quantities are computed from the 2D-S database in three ways: first, directly from the 2D-S data; second, by applying the 2D-S data to ice PSD parameterizations developed using sets of cirrus measurements collected using the older PMS probes; and third, by applying the 2D-S data to a similar parameterization developed using the 2D-S data themselves. This is done so that measurements of the same cloud volumes by parameterized versions of the 2DC and 2D-S can be compared with one another. It is thereby seen - given the same cloud field and given the same assumptions concerning ice crystal cross-sectional area, density, and radar cross section - that the parameterized 2D-S and the parameterized 2DC predict similar distributions of inferred shortwave extinction coefficient, ice water content, and 94 GHz radar reflectivity. However, the parameterization of the 2DC based on uncorrected data predicts a statistically significantly higher number of total ice crystals and a larger ratio of small ice crystals to large ice crystals than does the parameterized 2D-S. The 2DC parameterization based on shatter-corrected data also predicts statistically different numbers of ice crystals than does the parameterized 2D-S, but the
A statistical comparison of cirrus particle size distributions measured using the 2-D stereo probe during the TC4, SPARTICUS, and MACPEX flight campaigns with historical cirrus datasets

Directory of Open Access Journals (Sweden)

M. C. Schwartz

2017-08-01

Full Text Available This paper addresses two straightforward questions. First, how similar are the statistics of cirrus particle size distribution (PSD datasets collected using the Two-Dimensional Stereo (2D-S probe to cirrus PSD datasets collected using older Particle Measuring Systems (PMS 2-D Cloud (2DC and 2-D Precipitation (2DP probes? Second, how similar are the datasets when shatter-correcting post-processing is applied to the 2DC datasets? To answer these questions, a database of measured and parameterized cirrus PSDs – constructed from measurements taken during the Small Particles in Cirrus (SPARTICUS; Mid-latitude Airborne Cirrus Properties Experiment (MACPEX; and Tropical Composition, Cloud, and Climate Coupling (TC4 flight campaigns – is used.Bulk cloud quantities are computed from the 2D-S database in three ways: first, directly from the 2D-S data; second, by applying the 2D-S data to ice PSD parameterizations developed using sets of cirrus measurements collected using the older PMS probes; and third, by applying the 2D-S data to a similar parameterization developed using the 2D-S data themselves. This is done so that measurements of the same cloud volumes by parameterized versions of the 2DC and 2D-S can be compared with one another. It is thereby seen – given the same cloud field and given the same assumptions concerning ice crystal cross-sectional area, density, and radar cross section – that the parameterized 2D-S and the parameterized 2DC predict similar distributions of inferred shortwave extinction coefficient, ice water content, and 94 GHz radar reflectivity. However, the parameterization of the 2DC based on uncorrected data predicts a statistically significantly higher number of total ice crystals and a larger ratio of small ice crystals to large ice crystals than does the parameterized 2D-S. The 2DC parameterization based on shatter-corrected data also predicts statistically different numbers of ice crystals than does the
Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems

Science.gov (United States)

Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille

2016-04-01

Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not
POTENTIAL DISTRIBUTION OF ORANGE JASMINE (Murraya paniculata IN MEXICO

Directory of Open Access Journals (Sweden)

JosÃ© LÃ³pez-Collado

2013-04-01

Full Text Available Orange jasmine (OJ is a common ornamental plant used as green hedge in public and private gardens in Mexico. It also hosts Huanglongbing, a worldwide citrus disease and its vector, Diaphorina citri. For risk analysis and management purpose is important to know its geographic distribution. The potential distribution of OJ was calculated in Mexico using a deductive approach. Based on temperature and precipitation requirements, a relative suitability index was computed by combining the normalized values of both variables. The distribution was overlapped with captures of D. citri to check their spatial similarity. The results showed that the potential of occurrence is high in the Pacific and Gulf of MÃ©xico coastal states, including the YucatÃ¡n peninsula, and the lowest values appeared in the north-western states. The OJ distribution overlaps with Huanglongbing occurrence and coincided with captures of D. citri for most of the suitable area but D. citri captures extended beyond the optimal OJ distribution values in the northern regions of MÃ©xico.
Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

Science.gov (United States)

Keuleers, Emmanuel; Balota, David A

2015-01-01

This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.

An integrated dataset for in silico drug discovery

Directory of Open Access Journals (Sweden)

Cockell Simon J

2010-12-01

Full Text Available Drug development is expensive and prone to failure. It is potentially much less risky and expensive to reuse a drug developed for one condition for treating a second disease, than it is to develop an entirely new compound. Systematic approaches to drug repositioning are needed to increase throughput and find candidates more reliably. Here we address this need with an integrated systems biology dataset, developed using the Ondex data integration platform, for the in silico discovery of new drug repositioning candidates. We demonstrate that the information in this dataset allows known repositioning examples to be discovered. We also propose a means of automating the search for new treatment indications of existing compounds.
RARD: The Related-Article Recommendation Dataset

OpenAIRE

Beel, Joeran; Carevic, Zeljko; Schaible, Johann; Neusch, Gabor

2017-01-01

Recommender-system datasets are used for recommender-system evaluations, training machine-learning algorithms, and exploring user behavior. While there are many datasets for recommender systems in the domains of movies, books, and music, there are rather few datasets from research-paper recommender systems. In this paper, we introduce RARD, the Related-Article Recommendation Dataset, from the digital library Sowiport and the recommendation-as-a-service provider Mr. DLib. The dataset contains ...
Parametric distribution approach for flow availability in small hydro potential analysis

Science.gov (United States)

Abdullah, Samizee; Basri, Mohd Juhari Mat; Jamaluddin, Zahrul Zamri; Azrulhisham, Engku Ahmad; Othman, Jamel

2016-10-01

Small hydro system is one of the important sources of renewable energy and it has been recognized worldwide as clean energy sources. Small hydropower generation system uses the potential energy in flowing water to produce electricity is often questionable due to inconsistent and intermittent of power generated. Potential analysis of small hydro system which is mainly dependent on the availability of water requires the knowledge of water flow or stream flow distribution. This paper presented the possibility of applying Pearson system for stream flow availability distribution approximation in the small hydro system. By considering the stochastic nature of stream flow, the Pearson parametric distribution approximation was computed based on the significant characteristic of Pearson system applying direct correlation between the first four statistical moments of the distribution. The advantage of applying various statistical moments in small hydro potential analysis will have the ability to analyze the variation shapes of stream flow distribution.
Isfahan MISP Dataset.

Science.gov (United States)

Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

2017-01-01

An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).
Sensitivity of a numerical wave model on wind re-analysis datasets

Science.gov (United States)

Lavidas, George; Venugopal, Vengatesan; Friedrich, Daniel

2017-03-01

Wind is the dominant process for wave generation. Detailed evaluation of metocean conditions strengthens our understanding of issues concerning potential offshore applications. However, the scarcity of buoys and high cost of monitoring systems pose a barrier to properly defining offshore conditions. Through use of numerical wave models, metocean conditions can be hindcasted and forecasted providing reliable characterisations. This study reports the sensitivity of wind inputs on a numerical wave model for the Scottish region. Two re-analysis wind datasets with different spatio-temporal characteristics are used, the ERA-Interim Re-Analysis and the CFSR-NCEP Re-Analysis dataset. Different wind products alter results, affecting the accuracy obtained. The scope of this study is to assess different available wind databases and provide information concerning the most appropriate wind dataset for the specific region, based on temporal, spatial and geographic terms for wave modelling and offshore applications. Both wind input datasets delivered results from the numerical wave model with good correlation. Wave results by the 1-h dataset have higher peaks and lower biases, in expense of a high scatter index. On the other hand, the 6-h dataset has lower scatter but higher biases. The study shows how wind dataset affects the numerical wave modelling performance, and that depending on location and study needs, different wind inputs should be considered.
The NASA Subsonic Jet Particle Image Velocimetry (PIV) Dataset

Science.gov (United States)

Bridges, James; Wernet, Mark P.

2011-01-01

Many tasks in fluids engineering require prediction of turbulence of jet flows. The present document documents the single-point statistics of velocity, mean and variance, of cold and hot jet flows. The jet velocities ranged from 0.5 to 1.4 times the ambient speed of sound, and temperatures ranged from unheated to static temperature ratio 2.7. Further, the report assesses the accuracies of the data, e.g., establish uncertainties for the data. This paper covers the following five tasks: (1) Document acquisition and processing procedures used to create the particle image velocimetry (PIV) datasets. (2) Compare PIV data with hotwire and laser Doppler velocimetry (LDV) data published in the open literature. (3) Compare different datasets acquired at the same flow conditions in multiple tests to establish uncertainties. (4) Create a consensus dataset for a range of hot jet flows, including uncertainty bands. (5) Analyze this consensus dataset for self-consistency and compare jet characteristics to those of the open literature. The final objective was fulfilled by using the potential core length and the spread rate of the half-velocity radius to collapse of the mean and turbulent velocity fields over the first 20 jet diameters.
Semi-supervised tracking of extreme weather events in global spatio-temporal climate datasets

Science.gov (United States)

Kim, S. K.; Prabhat, M.; Williams, D. N.

2017-12-01

Deep neural networks have been successfully applied to solve problem to detect extreme weather events in large scale climate datasets and attend superior performance that overshadows all previous hand-crafted methods. Recent work has shown that multichannel spatiotemporal encoder-decoder CNN architecture is able to localize events in semi-supervised bounding box. Motivated by this work, we propose new learning metric based on Variational Auto-Encoders (VAE) and Long-Short-Term-Memory (LSTM) to track extreme weather events in spatio-temporal dataset. We consider spatio-temporal object tracking problems as learning probabilistic distribution of continuous latent features of auto-encoder using stochastic variational inference. For this, we assume that our datasets are i.i.d and latent features is able to be modeled by Gaussian distribution. In proposed metric, we first train VAE to generate approximate posterior given multichannel climate input with an extreme climate event at fixed time. Then, we predict bounding box, location and class of extreme climate events using convolutional layers given input concatenating three features including embedding, sampled mean and standard deviation. Lastly, we train LSTM with concatenated input to learn timely information of dataset by recurrently feeding output back to next time-step's input of VAE. Our contribution is two-fold. First, we show the first semi-supervised end-to-end architecture based on VAE to track extreme weather events which can apply to massive scaled unlabeled climate datasets. Second, the information of timely movement of events is considered for bounding box prediction using LSTM which can improve accuracy of localization. To our knowledge, this technique has not been explored neither in climate community or in Machine Learning community.
TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

KAUST Repository

Mü ller, Matthias; Bibi, Adel Aamer; Giancola, Silvio; Al-Subaihi, Salman; Ghanem, Bernard

2018-01-01

Despite the numerous developments in object tracking, further development of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.
TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

KAUST Repository

Müller, Matthias

2018-03-28

Despite the numerous developments in object tracking, further development of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.
SPICE: exploration and analysis of post-cytometric complex multivariate datasets.

Science.gov (United States)

Roederer, Mario; Nozzi, Joshua L; Nason, Martha C

2011-02-01

Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
Nonparametric Estimation of Distributions in Random Effects Models

KAUST Repository

Hart, Jeffrey D.

2011-01-01

We propose using minimum distance to obtain nonparametric estimates of the distributions of components in random effects models. A main setting considered is equivalent to having a large number of small datasets whose locations, and perhaps scales, vary randomly, but which otherwise have a common distribution. Interest focuses on estimating the distribution that is common to all datasets, knowledge of which is crucial in multiple testing problems where a location/scale invariant test is applied to every small dataset. A detailed algorithm for computing minimum distance estimates is proposed, and the usefulness of our methodology is illustrated by a simulation study and an analysis of microarray data. Supplemental materials for the article, including R-code and a dataset, are available online. © 2011 American Statistical Association.
Open University Learning Analytics dataset.

Science.gov (United States)

Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek

2017-11-28

Learning Analytics focuses on the collection and analysis of learners' data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students' interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.
Dataset of transcriptional landscape of B cell early activation

Directory of Open Access Journals (Sweden)

Alexander S. Garruss

2015-09-01

Full Text Available Signaling via B cell receptors (BCR and Toll-like receptors (TLRs result in activation of B cells with distinct physiological outcomes, but transcriptional regulatory mechanisms that drive activation and distinguish these pathways remain unknown. At early time points after BCR and TLR ligand exposure, 0.5 and 2 h, RNA-seq was performed allowing observations on rapid transcriptional changes. At 2 h, ChIP-seq was performed to allow observations on important regulatory mechanisms potentially driving transcriptional change. The dataset includes RNA-seq, ChIP-seq of control (Input, RNA Pol II, H3K4me3, H3K27me3, and a separate RNA-seq for miRNA expression, which can be found at Gene Expression Omnibus Dataset GSE61608. Here, we provide details on the experimental and analysis methods used to obtain and analyze this dataset and to examine the transcriptional landscape of B cell early activation.
Potential geographic distribution of hantavirus reservoirs in Brazil.

Directory of Open Access Journals (Sweden)

Stefan Vilges de Oliveira

Full Text Available Hantavirus cardiopulmonary syndrome is an emerging zoonosis in Brazil. Human infections occur via inhalation of aerosolized viral particles from excreta of infected wild rodents. Necromys lasiurus and Oligoryzomys nigripes appear to be the main reservoirs of hantavirus in the Atlantic Forest and Cerrado biomes. We estimated and compared ecological niches of the two rodent species, and analyzed environmental factors influencing their occurrence, to understand the geography of hantavirus transmission. N. lasiurus showed a wide potential distribution in Brazil, in the Cerrado, Caatinga, and Atlantic Forest biomes. Highest climate suitability for O. nigripes was observed along the Brazilian Atlantic coast. Maximum temperature in the warmest months and annual precipitation were the variables that most influence the distributions of N. lasiurus and O. nigripes, respectively. Models based on occurrences of infected rodents estimated a broader area of risk for hantavirus transmission in southeastern and southern Brazil, coinciding with the distribution of human cases of hantavirus cardiopulmonary syndrome. We found no demonstrable environmental differences among occurrence sites for the rodents and for human cases of hantavirus. However, areas of northern and northeastern Brazil are also apparently suitable for the two species, without broad coincidence with human cases. Modeling of niches and distributions of rodent reservoirs indicates potential for transmission of hantavirus across virtually all of Brazil outside the Amazon Basin.
Geoseq: a tool for dissecting deep-sequencing datasets

Directory of Open Access Journals (Sweden)

Homann Robert

2010-10-01

Full Text Available Abstract Background Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO, Sequence Read Archive (SRA hosted by the NCBI, or the DNA Data Bank of Japan (ddbj. Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Results Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Conclusions Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a identify differential isoform expression in mRNA-seq datasets, b identify miRNAs (microRNAs in libraries, and identify mature and star sequences in miRNAS and c to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.
FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.

Science.gov (United States)

Shcherbina, Anna

2014-08-15

High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible. FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step. FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge
Summer Steelhead Distribution [ds341

Data.gov (United States)

California Natural Resource Agency — Summer Steelhead Distribution October 2009 Version This dataset depicts observation-based stream-level geographic distribution of anadromous summer-run steelhead...
Creating a Regional MODIS Satellite-Driven Net Primary Production Dataset for European Forests

Directory of Open Access Journals (Sweden)

Mathias Neumann

2016-06-01

Full Text Available Net primary production (NPP is an important ecological metric for studying forest ecosystems and their carbon sequestration, for assessing the potential supply of food or timber and quantifying the impacts of climate change on ecosystems. The global MODIS NPP dataset using the MOD17 algorithm provides valuable information for monitoring NPP at 1-km resolution. Since coarse-resolution global climate data are used, the global dataset may contain uncertainties for Europe. We used a 1-km daily gridded European climate data set with the MOD17 algorithm to create the regional NPP dataset MODIS EURO. For evaluation of this new dataset, we compare MODIS EURO with terrestrial driven NPP from analyzing and harmonizing forest inventory data (NFI from 196,434 plots in 12 European countries as well as the global MODIS NPP dataset for the years 2000 to 2012. Comparing these three NPP datasets, we found that the global MODIS NPP dataset differs from NFI NPP by 26%, while MODIS EURO only differs by 7%. MODIS EURO also agrees with NFI NPP across scales (from continental, regional to country and gradients (elevation, location, tree age, dominant species, etc.. The agreement is particularly good for elevation, dominant species or tree height. This suggests that using improved climate data allows the MOD17 algorithm to provide realistic NPP estimates for Europe. Local discrepancies between MODIS EURO and NFI NPP can be related to differences in stand density due to forest management and the national carbon estimation methods. With this study, we provide a consistent, temporally continuous and spatially explicit productivity dataset for the years 2000 to 2012 on a 1-km resolution, which can be used to assess climate change impacts on ecosystems or the potential biomass supply of the European forests for an increasing bio-based economy. MODIS EURO data are made freely available at ftp://palantir.boku.ac.at/Public/MODIS_EURO.
Redox potential distribution of an organic-rich contaminated site obtained by the inversion of self-potential data

Science.gov (United States)

Abbas, M.; Jardani, A.; Soueid Ahmed, A.; Revil, A.; Brigaud, L.; Bégassat, Ph.; Dupont, J. P.

2017-11-01

Mapping the redox potential of shallow aquifers impacted by hydrocarbon contaminant plumes is important for the characterization and remediation of such contaminated sites. The redox potential of groundwater is indicative of the biodegradation of hydrocarbons and is important in delineating the shapes of contaminant plumes. The self-potential method was used to reconstruct the redox potential of groundwater associated with an organic-rich contaminant plume in northern France. The self-potential technique is a passive technique consisting in recording the electrical potential distribution at the surface of the Earth. A self-potential map is essentially the sum of two contributions, one associated with groundwater flow referred to as the electrokinetic component, and one associated with redox potential anomalies referred to as the electroredox component (thermoelectric and diffusion potentials are generally negligible). A groundwater flow model was first used to remove the electrokinetic component from the observed self-potential data. Then, a residual self-potential map was obtained. The source current density generating the residual self-potential signals is assumed to be associated with the position of the water table, an interface characterized by a change in both the electrical conductivity and the redox potential. The source current density was obtained through an inverse problem by minimizing a cost function including a data misfit contribution and a regularizer. This inversion algorithm allows the determination of the vertical and horizontal components of the source current density taking into account the electrical conductivity distribution of the saturated and non-saturated zones obtained independently by electrical resistivity tomography. The redox potential distribution was finally determined from the inverted residual source current density. A redox map was successfully built and the estimated redox potential values correlated well with in
Uranium and Iron XRF distribution and Fe speciation results

Data.gov (United States)

U.S. Environmental Protection Agency — Dataset 1: XRF image of U and Fe distribution Dataset 2: Fe linear combination fitting data. This dataset is associated with the following publication: Koster van...

Sexual differentiation in the distribution potential of northern jaguars (Panthera onca)

Science.gov (United States)

Boydston, Erin E.; Lopez Gonzalez, Carlos A.

2005-01-01

We estimated the potential geographic distribution of jaguars in the southwestern United States and northwestern Mexico by modeling the jaguar ecological niche from occurrence records. We modeled separately the distribution of males and females, assuming records of females probably represented established home ranges while male records likely included dispersal movements. The predicted distribution for males was larger than that for females. Eastern Sonora appeared capable for supporting male and female jaguars with potential range expansion into southeastern Arizona. New Mexico and Chihuahua contained environmental characteristics primarily limited to the male niche and thus may be areas into which males occasionally disperse.
Potential, Distribution, Ethno-Botany and Tapping Procedures of ...

African Journals Online (AJOL)

Potential, Distribution, Ethno-Botany and Tapping Procedures of Gum Producing Acacia Species in the Somali Region, Southeastern Ethiopia. ... Therefore, promotion of gum extraction in the Somali Region both for economic benefit of the community and sustainable management of the fragile ecosystem is recommended.
Mridangam stroke dataset

OpenAIRE

CompMusic

2014-01-01

The audio examples were recorded from a professional Carnatic percussionist in a semi-anechoic studio conditions by Akshay Anantapadmanabhan using SM-58 microphones and an H4n ZOOM recorder. The audio was sampled at 44.1 kHz and stored as 16 bit wav files. The dataset can be used for training models for each Mridangam stroke. /n/nA detailed description of the Mridangam and its strokes can be found in the paper below. A part of the dataset was used in the following paper. /nAkshay Anantapadman...
Potential Impacts of Climate Change on World Food Supply: Datasets from a Major Crop Modeling Study

Data.gov (United States)

National Aeronautics and Space Administration — Datasets from a Major Crop Modeling Study contain projected country and regional changes in grain crop yields due to global climate change. Equilibrium and transient...
Nuclear momentum distribution and potential energy surface in hexagonal ice

Science.gov (United States)

Lin, Lin; Morrone, Joseph; Car, Roberto; Parrinello, Michele

2011-03-01

The proton momentum distribution in ice Ih has been recently measured by deep inelastic neutron scattering and calculated from open path integral Car-Parrinello simulation. Here we report a detailed investigation of the relation between momentum distribution and potential energy surface based on both experiment and simulation results. The potential experienced by the proton is largely harmonic and characterized by 3 principal frequencies, which can be associated to weighted averages of phonon frequencies via lattice dynamics calculations. This approach also allows us to examine the importance of quantum effects on the dynamics of the oxygen nuclei close to the melting temperature. Finally we quantify the anharmonicity that is present in the potential acting on the protons. This work is supported by NSF and by DOE.
2008 TIGER/Line Nationwide Dataset

Data.gov (United States)

California Natural Resource Agency — This dataset contains a nationwide build of the 2008 TIGER/Line datasets from the US Census Bureau downloaded in April 2009. The TIGER/Line Shapefiles are an extract...
Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution.

Science.gov (United States)

Gangnon, Ronald E

2012-03-01

The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, whereas rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset. © 2011, The International Biometric Society.
Distributed Parallel Architecture for "Big Data"

Directory of Open Access Journals (Sweden)

Catalin BOJA

2012-01-01

Full Text Available This paper is an extension to the "Distributed Parallel Architecture for Storing and Processing Large Datasets" paper presented at the WSEAS SEPADS’12 conference in Cambridge. In its original version the paper went over the benefits of using a distributed parallel architecture to store and process large datasets. This paper analyzes the problem of storing, processing and retrieving meaningful insight from petabytes of data. It provides a survey on current distributed and parallel data processing technologies and, based on them, will propose an architecture that can be used to solve the analyzed problem. In this version there is more emphasis put on distributed files systems and the ETL processes involved in a distributed environment.
Exploring massive, genome scale datasets with the genometricorr package

KAUST Repository

Favorov, Alexander; Mularoni, Loris; Cope, Leslie M.; Medvedeva, Yulia; Mironov, Andrey A.; Makeev, Vsevolod J.; Wheelan, Sarah J.

2012-01-01

We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets. Availability and implementation: The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor. © 2012 Favorov et al.
Exploring massive, genome scale datasets with the genometricorr package

KAUST Repository

Favorov, Alexander

2012-05-31

We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets. Availability and implementation: The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor. © 2012 Favorov et al.
Design of an audio advertisement dataset

Science.gov (United States)

Fu, Yutao; Liu, Jihong; Zhang, Qi; Geng, Yuting

2015-12-01

Since more and more advertisements swarm into radios, it is necessary to establish an audio advertising dataset which could be used to analyze and classify the advertisement. A method of how to establish a complete audio advertising dataset is presented in this paper. The dataset is divided into four different kinds of advertisements. Each advertisement's sample is given in *.wav file format, and annotated with a txt file which contains its file name, sampling frequency, channel number, broadcasting time and its class. The classifying rationality of the advertisements in this dataset is proved by clustering the different advertisements based on Principal Component Analysis (PCA). The experimental results show that this audio advertisement dataset offers a reliable set of samples for correlative audio advertisement experimental studies.
A new dataset validation system for the Planetary Science Archive

Science.gov (United States)

Manaud, N.; Zender, J.; Heather, D.; Martinez, S.

2007-08-01

The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that
A conceptual prototype for the next-generation national elevation dataset

Science.gov (United States)

Stoker, Jason M.; Heidemann, Hans Karl; Evans, Gayla A.; Greenlee, Susan K.

2013-01-01

In 2012 the U.S. Geological Survey's (USGS) National Geospatial Program (NGP) funded a study to develop a conceptual prototype for a new National Elevation Dataset (NED) design with expanded capabilities to generate and deliver a suite of bare earth and above ground feature information over the United States. This report details the research on identifying operational requirements based on prior research, evaluation of what is needed for the USGS to meet these requirements, and development of a possible conceptual framework that could potentially deliver the kinds of information that are needed to support NGP's partners and constituents. This report provides an initial proof-of-concept demonstration using an existing dataset, and recommendations for the future, to inform NGP's ongoing and future elevation program planning and management decisions. The demonstration shows that this type of functional process can robustly create derivatives from lidar point cloud data; however, more research needs to be done to see how well it extends to multiple datasets.
Exploring massive, genome scale datasets with the GenometriCorr package.

Directory of Open Access Journals (Sweden)

Alexander Favorov

2012-05-01

Full Text Available We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features, for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets.The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor.
Climate change and the potential distribution of an invasive shrub, Lantana camara L.

Directory of Open Access Journals (Sweden)

Subhashni Taylor

Full Text Available The threat posed by invasive species, in particular weeds, to biodiversity may be exacerbated by climate change. Lantana camara L. (lantana is a woody shrub that is highly invasive in many countries of the world. It has a profound economic and environmental impact worldwide, including Australia. Knowledge of the likely potential distribution of this invasive species under current and future climate will be useful in planning better strategies to manage the invasion. A process-oriented niche model of L. camara was developed using CLIMEX to estimate its potential distribution under current and future climate scenarios. The model was calibrated using data from several knowledge domains, including phenological observations and geographic distribution records. The potential distribution of lantana under historical climate exceeded the current distribution in some areas of the world, notably Africa and Asia. Under future scenarios, the climatically suitable areas for L. camara globally were projected to contract. However, some areas were identified in North Africa, Europe and Australia that may become climatically suitable under future climates. In South Africa and China, its potential distribution could expand further inland. These results can inform strategic planning by biosecurity agencies, identifying areas to target for eradication or containment. Distribution maps of risk of potential invasion can be useful tools in public awareness campaigns, especially in countries that have been identified as becoming climatically suitable for L. camara under the future climate scenarios.
The GTZAN dataset

DEFF Research Database (Denmark)

Sturm, Bob L.

2013-01-01

The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge...... of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN...
Assessment of the Economic Potential of Distributed Wind in Colorado, Minnesota, and New York

Energy Technology Data Exchange (ETDEWEB)

McCabe, Kevin [National Renewable Energy Laboratory (NREL), Golden, CO (United States); Sigrin, Benjamin O. [National Renewable Energy Laboratory (NREL), Golden, CO (United States); Lantz, Eric J. [National Renewable Energy Laboratory (NREL), Golden, CO (United States); Mooney, Meghan E. [National Renewable Energy Laboratory (NREL), Golden, CO (United States)

2018-01-03

This work seeks to identify current and future spatial distributions of economic potential for behind-the-meter distributed wind, serving primarily rural or suburban homes, farms, and manufacturing facilities in Colorado, Minnesota, and New York. These states were identified by technical experts based on their current favorability for distributed wind deployment. We use NREL's Distributed Wind Market Demand Model (dWind) (Lantz et al. 2017; Sigrin et al. 2016) to identify and rank counties in each of the states by their overall and per capita potential. From this baseline assessment, we also explore how and where improvements in cost, performance, and other market sensitivities affect distributed wind potential.
Evaluating the use of different precipitation datasets in simulating a flood event

Science.gov (United States)

Akyurek, Z.; Ozkaya, A.

2016-12-01

Floods caused by convective storms in mountainous regions are sensitive to the temporal and spatial variability of rainfall. Space-time estimates of rainfall from weather radar, satellites and numerical weather prediction models can be a remedy to represent pattern of the rainfall with some inaccuracy. However, there is a strong need for evaluation of the performance and limitations of these estimates in hydrology. This study aims to provide a comparison of gauge, radar, satellite (Hydro-Estimator (HE)) and numerical weather prediciton model (Weather Research and Forecasting (WRF)) precipitation datasets during an extreme flood event (22.11.2014) lasting 40 hours in Samsun-Turkey. For this study, hourly rainfall data from 13 ground observation stations were used in the analyses. This event having a peak discharge of 541 m3/sec created flooding at the downstream of Terme Basin. Comparisons were performed in two parts. First the analysis were performed in areal and point based manner. Secondly, a semi-distributed hydrological model was used to assess the accuracy of the rainfall datasets to simulate river flows for the flood event. Kalman Filtering was used in the bias correction of radar rainfall data compared to gauge measurements. Radar, gauge, corrected radar, HE and WRF rainfall data were used as model inputs. Generally, the HE product underestimates the cumulative rainfall amounts in all stations, radar data underestimates the results in cumulative sense but keeps the consistency in the results. On the other hand, almost all stations in WRF mean statistics computations have better results compared to the HE product but worse than the radar dataset. Results in point comparisons indicated that, trend of the rainfall is captured by the radar rainfall estimation well but radar underestimates the maximum values. According to cumulative gauge value, radar underestimated the cumulative rainfall amount by % 32. Contrary to other datasets, the bias of WRF is positive
Climate Change Impacts on the Potential Distribution of Eogystia hippophaecolus in China.

Science.gov (United States)

Li, Xue; Ge, Xuezhen; Chen, Linghong; Zhang, Linjing; Wang, Tao; Shixiang, Zong

2018-05-28

Seabuckthorn carpenter moth, Eogystia hippophaecolus (Hua, Chou, Fang, & Chen, 1990), is the most important boring pest of sea buckthorn (Hippophae rhamnoides L.) in the northwest of China. It is responsible for the death of large areas of H. rhamnoides forest, seriously affecting the ecological environment and economic development in northwestern China. To clarify the potential distribution of E. hippophaecolus in China, the present study used the CLIMEX 4.0.0 model to project the potential distribution of the pest using historical climate data (1981-2010) and simulated future climate data (2011-2100) for China. Under historical climate condition, E. hippophaecolus would be found to be distributed mainly between 27° N - 51° N and 74° E - 134° E, with favorable and highly favorable habitats accounting for 35.2% of the total potential distribution. Under future climate conditions, E. hippophaecolus would be distributed mainly between 27° N - 53° N and 74° E - 134° E, with the possibility of moving in a northwest direction. Under these conditions, the proportion of the total area providing a favorable and highly favorable habitat may decrease to about 33%. These results will help to identify the impact of climate change on the potential distribution of E. hippophaecolus, thereby providing a theoretical basis for monitoring and early forecasting of pest outbreaks. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Multiresolution persistent homology for excessively large biomolecular datasets

Energy Technology Data Exchange (ETDEWEB)

Xia, Kelin; Zhao, Zhixiong [Department of Mathematics, Michigan State University, East Lansing, Michigan 48824 (United States); Wei, Guo-Wei, E-mail: wei@math.msu.edu [Department of Mathematics, Michigan State University, East Lansing, Michigan 48824 (United States); Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824 (United States); Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824 (United States)

2015-10-07

Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.

Eccentricity samples: Implications on the potential and the velocity distribution

Directory of Open Access Journals (Sweden)

Cubarsi R.

2017-01-01

Full Text Available Planar and vertical epicycle frequencies and local angular velocity are related to the derivatives up to the second order of the local potential and can be used to test the shape of the potential from stellar disc samples. These samples show a more complex velocity distribution than halo stars and should provide a more realistic test. We assume an axisymmetric potential allowing a mixture of independent ellipsoidal velocity distributions, of separable or Staeckel form in cylindrical or spherical coordinates. We prove that values of local constants are not consistent with a potential separable in addition in cylindrical coordinates and with a spherically symmetric potential. The simplest potential that fits the local constants is used to show that the harmonical and non-harmonical terms of the potential are equally important. The same analysis is used to estimate the local constants. Two families of nested subsamples selected for decreasing planar and vertical eccentricities are used to borne out the relation between the mean squared planar and vertical eccentricities and the velocity dispersions of the subsamples. According to the first-order epicycle model, the radial and vertical velocity components provide accurate information on the planar and vertical epicycle frequencies. However, it is impossible to account for the asymmetric drift which introduces a systematic bias in estimation of the third constant. Under a more general model, when the asymmetric drift is taken into account, the rotation velocity dispersions together with their asymmetric drift provide the correct fit for the local angular velocity. The consistency of the results shows that this new method based on the distribution of eccentricities is worth using for kinematic stellar samples. [Project of the Serbian Ministry of Education, Science and Technological Development, Grant no. No 176011: Dynamics and Kinematics of Celestial Bodies and Systems
Spatial distribution of potential and positive Aedes aegypti breeding sites

Directory of Open Access Journals (Sweden)

Daniel Elías Cuartas

2017-03-01

Conclusions: The spatial relationship between positive and potential A. aegypti breeding sites both indoors and outdoors is dynamic and highly sensitive to the characteristics of each territory. Knowing how positive and potential breeding sites are distributed contributes to the prioritization of resources and actions in vector control programs.
A signature-based method for indexing cell cycle phase distribution from microarray profiles

Directory of Open Access Journals (Sweden)

Mizuno Hideaki

2009-03-01

Full Text Available Abstract Background The cell cycle machinery interprets oncogenic signals and reflects the biology of cancers. To date, various methods for cell cycle phase estimation such as mitotic index, S phase fraction, and immunohistochemistry have provided valuable information on cancers (e.g. proliferation rate. However, those methods rely on one or few measurements and the scope of the information is limited. There is a need for more systematic cell cycle analysis methods. Results We developed a signature-based method for indexing cell cycle phase distribution from microarray profiles under consideration of cycling and non-cycling cells. A cell cycle signature masterset, composed of genes which express preferentially in cycling cells and in a cell cycle-regulated manner, was created to index the proportion of cycling cells in the sample. Cell cycle signature subsets, composed of genes whose expressions peak at specific stages of the cell cycle, were also created to index the proportion of cells in the corresponding stages. The method was validated using cell cycle datasets and quiescence-induced cell datasets. Analyses of a mouse tumor model dataset and human breast cancer datasets revealed variations in the proportion of cycling cells. When the influence of non-cycling cells was taken into account, "buried" cell cycle phase distributions were depicted that were oncogenic-event specific in the mouse tumor model dataset and were associated with patients' prognosis in the human breast cancer datasets. Conclusion The signature-based cell cycle analysis method presented in this report, would potentially be of value for cancer characterization and diagnostics.
Bulk Data Movement for Climate Dataset: Efficient Data Transfer Management with Dynamic Transfer Adjustment

International Nuclear Information System (INIS)

Sim, Alexander; Balman, Mehmet; Williams, Dean; Shoshani, Arie; Natarajan, Vijaya

2010-01-01

Many scientific applications and experiments, such as high energy and nuclear physics, astrophysics, climate observation and modeling, combustion, nano-scale material sciences, and computational biology, generate extreme volumes of data with a large number of files. These data sources are distributed among national and international data repositories, and are shared by large numbers of geographically distributed scientists. A large portion of data is frequently accessed, and a large volume of data is moved from one place to another for analysis and storage. One challenging issue in such efforts is the limited network capacity for moving large datasets to explore and manage. The Bulk Data Mover (BDM), a data transfer management tool in the Earth System Grid (ESG) community, has been managing the massive dataset transfers efficiently with the pre-configured transfer properties in the environment where the network bandwidth is limited. Dynamic transfer adjustment was studied to enhance the BDM to handle significant end-to-end performance changes in the dynamic network environment as well as to control the data transfers for the desired transfer performance. We describe the results from the BDM transfer management for the climate datasets. We also describe the transfer estimation model and results from the dynamic transfer adjustment.
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

Science.gov (United States)

Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene

2018-01-01

Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie
Creating a Regional MODIS Satellite-Driven Net Primary Production Dataset for European Forests

OpenAIRE

Neumann, Mathias; Moreno, Adam; Thurnher, Christopher; Mues, Volker; Härkönen, Sanna; Mura, Matteo; Bouriaud, Olivier; Lang, Mait; Cardellini, Giuseppe; Thivolle-Cazat, Alain; Bronisz, Karol; Merganic, Jan; Alberdi, Iciar; Astrup, Rasmus; Mohren, Frits

2016-01-01

Net primary production (NPP) is an important ecological metric for studying forest ecosystems and their carbon sequestration, for assessing the potential supply of food or timber and quantifying the impacts of climate change on ecosystems. The global MODIS NPP dataset using the MOD17 algorithm provides valuable information for monitoring NPP at 1-km resolution. Since coarse-resolution global climate data are used, the global dataset may contain uncertainties for Europe. We used a 1-km daily g...
High resolution population distribution maps for Southeast Asia in 2010 and 2015.

Science.gov (United States)

Gaughan, Andrea E; Stevens, Forrest R; Linard, Catherine; Jia, Peng; Tatem, Andrew J

2013-01-01

Spatially accurate, contemporary data on human population distributions are vitally important to many applied and theoretical researchers. The Southeast Asia region has undergone rapid urbanization and population growth over the past decade, yet existing spatial population distribution datasets covering the region are based principally on population count data from censuses circa 2000, with often insufficient spatial resolution or input data to map settlements precisely. Here we outline approaches to construct a database of GIS-linked circa 2010 census data and methods used to construct fine-scale (∼100 meters spatial resolution) population distribution datasets for each country in the Southeast Asia region. Landsat-derived settlement maps and land cover information were combined with ancillary datasets on infrastructure to model population distributions for 2010 and 2015. These products were compared with those from two other methods used to construct commonly used global population datasets. Results indicate mapping accuracies are consistently higher when incorporating land cover and settlement information into the AsiaPop modelling process. Using existing data, it is possible to produce detailed, contemporary and easily updatable population distribution datasets for Southeast Asia. The 2010 and 2015 datasets produced are freely available as a product of the AsiaPop Project and can be downloaded from: www.asiapop.org.
High resolution population distribution maps for Southeast Asia in 2010 and 2015.

Directory of Open Access Journals (Sweden)

Andrea E Gaughan

Full Text Available Spatially accurate, contemporary data on human population distributions are vitally important to many applied and theoretical researchers. The Southeast Asia region has undergone rapid urbanization and population growth over the past decade, yet existing spatial population distribution datasets covering the region are based principally on population count data from censuses circa 2000, with often insufficient spatial resolution or input data to map settlements precisely. Here we outline approaches to construct a database of GIS-linked circa 2010 census data and methods used to construct fine-scale (∼100 meters spatial resolution population distribution datasets for each country in the Southeast Asia region. Landsat-derived settlement maps and land cover information were combined with ancillary datasets on infrastructure to model population distributions for 2010 and 2015. These products were compared with those from two other methods used to construct commonly used global population datasets. Results indicate mapping accuracies are consistently higher when incorporating land cover and settlement information into the AsiaPop modelling process. Using existing data, it is possible to produce detailed, contemporary and easily updatable population distribution datasets for Southeast Asia. The 2010 and 2015 datasets produced are freely available as a product of the AsiaPop Project and can be downloaded from: www.asiapop.org.
Editorial: Datasets for Learning Analytics

NARCIS (Netherlands)

Dietze, Stefan; George, Siemens; Davide, Taibi; Drachsler, Hendrik

2018-01-01

The European LinkedUp and LACE (Learning Analytics Community Exchange) project have been responsible for setting up a series of data challenges at the LAK conferences 2013 and 2014 around the LAK dataset. The LAK datasets consists of a rich collection of full text publications in the domain of
Maxent modelling for predicting the potential distribution of Thai Palms

DEFF Research Database (Denmark)

Tovaranonte, Jantrararuk; Barfod, Anders S.; Overgaard, Anne Blach

2011-01-01

on presence data. The aim was to identify potential hot spot areas, assess the determinants of palm distribution ranges, and provide a firmer knowledge base for future conservation actions. We focused on a relatively small number of climatic, environmental and spatial variables in order to avoid...... overprediction of species distribution ranges. The models with the best predictive power were found by calculating the area under the curve (AUC) of receiver-operating characteristic (ROC). Here, we provide examples of contrasting predicted species distribution ranges as well as a map of modeled palm diversity...
Climate Forcing Datasets for Agricultural Modeling: Merged Products for Gap-Filling and Historical Climate Series Estimation

Science.gov (United States)

Ruane, Alex C.; Goldberg, Richard; Chryssanthacopoulos, James

2014-01-01

The AgMERRA and AgCFSR climate forcing datasets provide daily, high-resolution, continuous, meteorological series over the 1980-2010 period designed for applications examining the agricultural impacts of climate variability and climate change. These datasets combine daily resolution data from retrospective analyses (the Modern-Era Retrospective Analysis for Research and Applications, MERRA, and the Climate Forecast System Reanalysis, CFSR) with in situ and remotely-sensed observational datasets for temperature, precipitation, and solar radiation, leading to substantial reductions in bias in comparison to a network of 2324 agricultural-region stations from the Hadley Integrated Surface Dataset (HadISD). Results compare favorably against the original reanalyses as well as the leading climate forcing datasets (Princeton, WFD, WFD-EI, and GRASP), and AgMERRA distinguishes itself with substantially improved representation of daily precipitation distributions and extreme events owing to its use of the MERRA-Land dataset. These datasets also peg relative humidity to the maximum temperature time of day, allowing for more accurate representation of the diurnal cycle of near-surface moisture in agricultural models. AgMERRA and AgCFSR enable a number of ongoing investigations in the Agricultural Model Intercomparison and Improvement Project (AgMIP) and related research networks, and may be used to fill gaps in historical observations as well as a basis for the generation of future climate scenarios.
Leptospirosis in Mexico: Epidemiology and Potential Distribution of Human Cases

Science.gov (United States)

Sánchez-Montes, Sokani; Espinosa-Martínez, Deborah V.; Ríos-Muñoz, César A.; Berzunza-Cruz, Miriam; Becker, Ingeborg

2015-01-01

Background Leptospirosis is widespread in Mexico, yet the potential distribution and risk of the disease remain unknown. Methodology/Principal Findings We analysed morbidity and mortality according to age and gender based on three sources of data reported by the Ministry of Health and the National Institute of Geography and Statics of Mexico, for the decade 2000–2010. A total of 1,547 cases were reported in 27 states, the majority of which were registered during the rainy season, and the most affected age group was 25–44 years old. Although leptospirosis has been reported as an occupational disease of males, analysis of morbidity in Mexico showed no male preference. A total number of 198 deaths were registered in 21 states, mainly in urban settings. Mortality was higher in males (61.1%) as compared to females (38.9%), and the case fatality ratio was also increased in males. The overall case fatality ratio in Mexico was elevated (12.8%), as compared to other countries. We additionally determined the potential disease distribution by examining the spatial epidemiology combined with spatial modeling using ecological niche modeling techniques. We identified regions where leptospirosis could be present and created a potential distribution map using bioclimatic variables derived from temperature and precipitation. Our data show that the distribution of the cases was more related to temperature (75%) than to precipitation variables. Ecological niche modeling showed predictive areas that were widely distributed in central and southern Mexico, excluding areas characterized by extreme climates. Conclusions/Significance In conclusion, an epidemiological surveillance of leptospirosis is recommended in Mexico, since 55.7% of the country has environmental conditions fulfilling the criteria that favor the presence of the disease. PMID:26207827
Generalised extreme value distributions provide a natural hypothesis for the shape of seed mass distributions.

Directory of Open Access Journals (Sweden)

Will Edwards

Full Text Available Among co-occurring species, values for functionally important plant traits span orders of magnitude, are uni-modal, and generally positively skewed. Such data are usually log-transformed "for normality" but no convincing mechanistic explanation for a log-normal expectation exists. Here we propose a hypothesis for the distribution of seed masses based on generalised extreme value distributions (GEVs, a class of probability distributions used in climatology to characterise the impact of event magnitudes and frequencies; events that impose strong directional selection on biological traits. In tests involving datasets from 34 locations across the globe, GEVs described log10 seed mass distributions as well or better than conventional normalising statistics in 79% of cases, and revealed a systematic tendency for an overabundance of small seed sizes associated with low latitudes. GEVs characterise disturbance events experienced in a location to which individual species' life histories could respond, providing a natural, biological explanation for trait expression that is lacking from all previous hypotheses attempting to describe trait distributions in multispecies assemblages. We suggest that GEVs could provide a mechanistic explanation for plant trait distributions and potentially link biology and climatology under a single paradigm.
Sexual differentiation in the distribution potential of northern jaguars (Panthera onca)

Science.gov (United States)

Erin E. Boydston; Carlos A. Lopez Gonzalez

2005-01-01

We estimated the potential geographic distribution of jaguars in the southwestern United States and northwestern Mexico by modeling the jaguar ecological niche from occurrence records. We modeled separately the distributions of males and females, assuming records of females probably represented established home ranges while male records likely included dispersal...
Integration of geophysical datasets by a conjoint probability tomography approach: application to Italian active volcanic areas

Directory of Open Access Journals (Sweden)

D. Patella

2008-06-01

Full Text Available We expand the theory of probability tomography to the integration of different geophysical datasets. The aim of the new method is to improve the information quality using a conjoint occurrence probability function addressed to highlight the existence of common sources of anomalies. The new method is tested on gravity, magnetic and self-potential datasets collected in the volcanic area of Mt. Vesuvius (Naples, and on gravity and dipole geoelectrical datasets collected in the volcanic area of Mt. Etna (Sicily. The application demonstrates that, from a probabilistic point of view, the integrated analysis can delineate the signature of some important volcanic targets better than the analysis of the tomographic image of each dataset considered separately.
An Annotated Dataset of 14 Meat Images

DEFF Research Database (Denmark)

Stegmann, Mikkel Bille

2002-01-01

This note describes a dataset consisting of 14 annotated images of meat. Points of correspondence are placed on each image. As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given.......This note describes a dataset consisting of 14 annotated images of meat. Points of correspondence are placed on each image. As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given....
Comparison of recent SnIa datasets

International Nuclear Information System (INIS)

Sanchez, J.C. Bueno; Perivolaropoulos, L.; Nesseris, S.

2009-01-01

We rank the six latest Type Ia supernova (SnIa) datasets (Constitution (C), Union (U), ESSENCE (Davis) (E), Gold06 (G), SNLS 1yr (S) and SDSS-II (D)) in the context of the Chevalier-Polarski-Linder (CPL) parametrization w(a) = w 0 +w 1 (1−a), according to their Figure of Merit (FoM), their consistency with the cosmological constant (ΛCDM), their consistency with standard rulers (Cosmic Microwave Background (CMB) and Baryon Acoustic Oscillations (BAO)) and their mutual consistency. We find a significant improvement of the FoM (defined as the inverse area of the 95.4% parameter contour) with the number of SnIa of these datasets ((C) highest FoM, (U), (G), (D), (E), (S) lowest FoM). Standard rulers (CMB+BAO) have a better FoM by about a factor of 3, compared to the highest FoM SnIa dataset (C). We also find that the ranking sequence based on consistency with ΛCDM is identical with the corresponding ranking based on consistency with standard rulers ((S) most consistent, (D), (C), (E), (U), (G) least consistent). The ranking sequence of the datasets however changes when we consider the consistency with an expansion history corresponding to evolving dark energy (w 0 ,w 1 ) = (−1.4,2) crossing the phantom divide line w = −1 (it is practically reversed to (G), (U), (E), (S), (D), (C)). The SALT2 and MLCS2k2 fitters are also compared and some peculiar features of the SDSS-II dataset when standardized with the MLCS2k2 fitter are pointed out. Finally, we construct a statistic to estimate the internal consistency of a collection of SnIa datasets. We find that even though there is good consistency among most samples taken from the above datasets, this consistency decreases significantly when the Gold06 (G) dataset is included in the sample
SIMADL: Simulated Activities of Daily Living Dataset

Directory of Open Access Journals (Sweden)

Talal Alshammari

2018-04-01

Full Text Available With the realisation of the Internet of Things (IoT paradigm, the analysis of the Activities of Daily Living (ADLs, in a smart home environment, is becoming an active research domain. The existence of representative datasets is a key requirement to advance the research in smart home design. Such datasets are an integral part of the visualisation of new smart home concepts as well as the validation and evaluation of emerging machine learning models. Machine learning techniques that can learn ADLs from sensor readings are used to classify, predict and detect anomalous patterns. Such techniques require data that represent relevant smart home scenarios, for training, testing and validation. However, the development of such machine learning techniques is limited by the lack of real smart home datasets, due to the excessive cost of building real smart homes. This paper provides two datasets for classification and anomaly detection. The datasets are generated using OpenSHS, (Open Smart Home Simulator, which is a simulation software for dataset generation. OpenSHS records the daily activities of a participant within a virtual environment. Seven participants simulated their ADLs for different contexts, e.g., weekdays, weekends, mornings and evenings. Eighty-four files in total were generated, representing approximately 63 days worth of activities. Forty-two files of classification of ADLs were simulated in the classification dataset and the other forty-two files are for anomaly detection problems in which anomalous patterns were simulated and injected into the anomaly detection dataset.
The NOAA Dataset Identifier Project

Science.gov (United States)

de la Beaujardiere, J.; Mccullough, H.; Casey, K. S.

2013-12-01

The US National Oceanic and Atmospheric Administration (NOAA) initiated a project in 2013 to assign persistent identifiers to datasets archived at NOAA and to create informational landing pages about those datasets. The goals of this project are to enable the citation of datasets used in products and results in order to help provide credit to data producers, to support traceability and reproducibility, and to enable tracking of data usage and impact. A secondary goal is to encourage the submission of datasets for long-term preservation, because only archived datasets will be eligible for a NOAA-issued identifier. A team was formed with representatives from the National Geophysical, Oceanographic, and Climatic Data Centers (NGDC, NODC, NCDC) to resolve questions including which identifier scheme to use (answer: Digital Object Identifier - DOI), whether or not to embed semantics in identifiers (no), the level of granularity at which to assign identifiers (as coarsely as reasonable), how to handle ongoing time-series data (do not break into chunks), creation mechanism for the landing page (stylesheet from formal metadata record preferred), and others. Decisions made and implementation experience gained will inform the writing of a Data Citation Procedural Directive to be issued by the Environmental Data Management Committee in 2014. Several identifiers have been issued as of July 2013, with more on the way. NOAA is now reporting the number as a metric to federal Open Government initiatives. This paper will provide further details and status of the project.
Control Measure Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — The EPA Control Measure Dataset is a collection of documents describing air pollution control available to regulated facilities for the control and abatement of air...

The Kinetics Human Action Video Dataset

OpenAIRE

Kay, Will; Carreira, Joao; Simonyan, Karen; Zhang, Brian; Hillier, Chloe; Vijayanarasimhan, Sudheendra; Viola, Fabio; Green, Tim; Back, Trevor; Natsev, Paul; Suleyman, Mustafa; Zisserman, Andrew

2017-01-01

We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some ...
Potential impacts of climatic change upon geographical distributions of birds

DEFF Research Database (Denmark)

Huntley, Brian; Collingham, Yvonne C.; Green, Rhys E.

2006-01-01

likely to decrease. Species with restricted distributions and specialized species of particular biomes are likely to suffer the greatest impacts. Migrant species are likely to suffer especially large impacts as climatic change alters both their breeding and wintering areas, as well as critical stopover......Potential climatic changes of the near future have important characteristics that differentiate them from the largest magnitude and most rapid of climatic changes of the Quaternary. These potential climatic changes are thus a cause for considerable concern in terms of their possible impacts upon...... biodiversity. Birds, in common with other terrestrial organisms, are expected to exhibit one of two general responses to climatic change: they may adapt to the changed conditions without shifting location, or they may show a spatial response, adjusting their geographical distribution in response...
Distributed terascale volume visualization using distributed shared virtual memory

KAUST Repository

Beyer, Johanna; Hadwiger, Markus; Schneider, Jens; Jeong, Wonki; Pfister, Hanspeter

2011-01-01

Table 1 illustrates the impact of different distribution unit sizes, different screen resolutions, and numbers of GPU nodes. We use two and four GPUs (NVIDIA Quadro 5000 with 2.5 GB memory) and a mouse cortex EM dataset (see Figure 2) of resolution
Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.

Science.gov (United States)

Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi

2015-09-01

Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the
Characterization of electrical conductivity of carbon fiber reinforced plastic using surface potential distribution

Science.gov (United States)

Kikunaga, Kazuya; Terasaki, Nao

2018-04-01

A new method of evaluating electrical conductivity in a structural material such as carbon fiber reinforced plastic (CFRP) using surface potential is proposed. After the CFRP was charged by corona discharge, the surface potential distribution was measured by scanning a vibrating linear array sensor along the object surface with a high spatial resolution over a short duration. A correlation between the weave pattern of the CFRP and the surface potential distribution was observed. This result indicates that it is possible to evaluate the electrical conductivity of a material comprising conducting and insulating regions.
Distribution of genes associated with yield potential and water ...

Indian Academy of Sciences (India)

Supplementary data: Distribution of genes associated with yield potential and water-saving in. Chinese Zone II wheat detected by developed functional markers. Zhenxian Gao, Zhanliang Shi, Aimin Zhang and Jinkao Guo. J. Genet. 94, 35–42. Table 1. Functional markers for high-yield or water-saving genes in wheat and ...
Distributed Data Management and Distributed File Systems

CERN Document Server

Girone, Maria

2015-01-01

The LHC program has been successful in part due to the globally distributed computing resources used for collecting, serving, processing, and analyzing the large LHC datasets. The introduction of distributed computing early in the LHC program spawned the development of new technologies and techniques to synchronize information and data between physically separated computing centers. Two of the most challenges services are the distributed file systems and the distributed data management systems. In this paper I will discuss how we have evolved from local site services to more globally independent services in the areas of distributed file systems and data management and how these capabilities may continue to evolve into the future. I will address the design choices, the motivations, and the future evolution of the computing systems used for High Energy Physics.
CLARA-A1: a cloud, albedo, and radiation dataset from 28 yr of global AVHRR data

Directory of Open Access Journals (Sweden)

K.-G. Karlsson

2013-05-01

Full Text Available A new satellite-derived climate dataset – denoted CLARA-A1 ("The CM SAF cLoud, Albedo and RAdiation dataset from AVHRR data" – is described. The dataset covers the 28 yr period from 1982 until 2009 and consists of cloud, surface albedo, and radiation budget products derived from the AVHRR (Advanced Very High Resolution Radiometer sensor carried by polar-orbiting operational meteorological satellites. Its content, anticipated accuracies, limitations, and potential applications are described. The dataset is produced by the EUMETSAT Climate Monitoring Satellite Application Facility (CM SAF project. The dataset has its strengths in the long duration, its foundation upon a homogenized AVHRR radiance data record, and in some unique features, e.g. the availability of 28 yr of summer surface albedo and cloudiness parameters over the polar regions. Quality characteristics are also well investigated and particularly useful results can be found over the tropics, mid to high latitudes and over nearly all oceanic areas. Being the first CM SAF dataset of its kind, an intensive evaluation of the quality of the datasets was performed and major findings with regard to merits and shortcomings of the datasets are reported. However, the CM SAF's long-term commitment to perform two additional reprocessing events within the time frame 2013–2018 will allow proper handling of limitations as well as upgrading the dataset with new features (e.g. uncertainty estimates and extension of the temporal coverage.
Fluxnet Synthesis Dataset Collaboration Infrastructure

Energy Technology Data Exchange (ETDEWEB)

Agarwal, Deborah A. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Humphrey, Marty [Univ. of Virginia, Charlottesville, VA (United States); van Ingen, Catharine [Microsoft. San Francisco, CA (United States); Beekwilder, Norm [Univ. of Virginia, Charlottesville, VA (United States); Goode, Monte [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Jackson, Keith [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Rodriguez, Matt [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Weber, Robin [Univ. of California, Berkeley, CA (United States)

2008-02-06

The Fluxnet synthesis dataset originally compiled for the La Thuile workshop contained approximately 600 site years. Since the workshop, several additional site years have been added and the dataset now contains over 920 site years from over 240 sites. A data refresh update is expected to increase those numbers in the next few months. The ancillary data describing the sites continues to evolve as well. There are on the order of 120 site contacts and 60proposals have been approved to use thedata. These proposals involve around 120 researchers. The size and complexity of the dataset and collaboration has led to a new approach to providing access to the data and collaboration support and the support team attended the workshop and worked closely with the attendees and the Fluxnet project office to define the requirements for the support infrastructure. As a result of this effort, a new website (http://www.fluxdata.org) has been created to provide access to the Fluxnet synthesis dataset. This new web site is based on a scientific data server which enables browsing of the data on-line, data download, and version tracking. We leverage database and data analysis tools such as OLAP data cubes and web reports to enable browser and Excel pivot table access to the data.
Distribution of potentially hazardous phases in the subsurface at Yucca Mountain, Nevada

International Nuclear Information System (INIS)

Guthrie, G.D. Jr.; Bish, D.L.; Chipera, S.J.; Raymond, R. Jr.

1995-05-01

Drilling, trenching, excavation of the Exploratory Studies Facility, and other surface and underground-distributing activities have the potential to release minerals into the environment from tuffs at Yucca Mountain, Nevada. Some of these minerals may be potential respiratory health hazards. Therefore, an understanding of the distribution of the minerals that may potentially be liberated during site-characterization and operation of the potential repository is crucial to ensuring worker and public safety. Analysis of previously reported mineralogy of Yucca Mountain tuffs using data and criteria from the International Agency for Research on Cancer (IARC) suggests that the following minerals are of potential concern: quartz, cristobalite, tridymite, opal-CT, erionite, mordenite, and palygorskite. The authors have re-evaluated the three-dimensional mineral distribution at Yucca Mountain above the static water level both in bulk-rock samples and in fractures, using quantitative X-ray powder diffraction analysis. Erionite, mordenite, and palygorskite occur primarily in fractures; the crystalline-silica minerals, quartz, cristobalite, and tridymite are major bulk-rock phases. Erionite occurs in the altered zone just above the lower Topopah Spring Member vitrophyre, and an occurrence below the vitrophyre but above the Calico Hills has recently been identified. In this latter occurrence, erionite is present in the matrix at levels up to 35 wt%. Mordenite and palygorskite occur throughout the vadose zone nearly to the surface. Opal-CT is limited to zeolitic horizons
Introducing a Web API for Dataset Submission into a NASA Earth Science Data Center

Science.gov (United States)

Moroni, D. F.; Quach, N.; Francis-Curley, W.

2016-12-01

As the landscape of data becomes increasingly more diverse in the domain of Earth Science, the challenges of managing and preserving data become more onerous and complex, particularly for data centers on fixed budgets and limited staff. Many solutions already exist to ease the cost burden for the downstream component of the data lifecycle, yet most archive centers are still racing to keep up with the influx of new data that still needs to find a quasi-permanent resting place. For instance, having well-defined metadata that is consistent across the entire data landscape provides for well-managed and preserved datasets throughout the latter end of the data lifecycle. Translators between different metadata dialects are already in operational use, and facilitate keeping older datasets relevant in today's world of rapidly evolving metadata standards. However, very little is done to address the first phase of the lifecycle, which deals with the entry of both data and the corresponding metadata into a system that is traditionally opaque and closed off to external data producers, thus resulting in a significant bottleneck to the dataset submission process. The ATRAC system was the NOAA NCEI's answer to this previously obfuscated barrier to scientists wishing to find a home for their climate data records, providing a web-based entry point to submit timely and accurate metadata and information about a very specific dataset. A couple of NASA's Distributed Active Archive Centers (DAACs) have implemented their own versions of a web-based dataset and metadata submission form including the ASDC and the ORNL DAAC. The Physical Oceanography DAAC is the most recent in the list of NASA-operated DAACs who have begun to offer their own web-based dataset and metadata submission services to data producers. What makes the PO.DAAC dataset and metadata submission service stand out from these pre-existing services is the option of utilizing both a web browser GUI and a RESTful API to
Simulation of Smart Home Activity Datasets

Directory of Open Access Journals (Sweden)

Jonathan Synnott

2015-06-01

Full Text Available A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
Simulation of Smart Home Activity Datasets.

Science.gov (United States)

Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

2015-06-16

A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
LenoxKaplan_Role of natural gas in meeting electric sector emissions reduction strategy_dataset

Data.gov (United States)

U.S. Environmental Protection Agency — This dataset is for an analysis that used the MARKAL linear optimization model to compare the carbon emissions profiles and system-wide global warming potential of...
Vikodak--A Modular Framework for Inferring Functional Potential of Microbial Communities from 16S Metagenomic Datasets.

Directory of Open Access Journals (Sweden)

Sunil Nagpal

Full Text Available The overall metabolic/functional potential of any given environmental niche is a function of the sum total of genes/proteins/enzymes that are encoded and expressed by various interacting microbes residing in that niche. Consequently, prior (collated information pertaining to genes, enzymes encoded by the resident microbes can aid in indirectly (reconstructing/ inferring the metabolic/ functional potential of a given microbial community (given its taxonomic abundance profile. In this study, we present Vikodak--a multi-modular package that is based on the above assumption and automates inferring and/ or comparing the functional characteristics of an environment using taxonomic abundance generated from one or more environmental sample datasets. With the underlying assumptions of co-metabolism and independent contributions of different microbes in a community, a concerted effort has been made to accommodate microbial co-existence patterns in various modules incorporated in Vikodak.Validation experiments on over 1400 metagenomic samples have confirmed the utility of Vikodak in (a deciphering enzyme abundance profiles of any KEGG metabolic pathway, (b functional resolution of distinct metagenomic environments, (c inferring patterns of functional interaction between resident microbes, and (d automating statistical comparison of functional features of studied microbiomes. Novel features incorporated in Vikodak also facilitate automatic removal of false positives and spurious functional predictions.With novel provisions for comprehensive functional analysis, inclusion of microbial co-existence pattern based algorithms, automated inter-environment comparisons; in-depth analysis of individual metabolic pathways and greater flexibilities at the user end, Vikodak is expected to be an important value addition to the family of existing tools for 16S based function prediction.A web implementation of Vikodak can be publicly accessed at: http
Distributed Generation Market Demand Model (dGen): Documentation

Energy Technology Data Exchange (ETDEWEB)

Sigrin, Benjamin [National Renewable Energy Lab. (NREL), Golden, CO (United States); Gleason, Michael [National Renewable Energy Lab. (NREL), Golden, CO (United States); Preus, Robert [National Renewable Energy Lab. (NREL), Golden, CO (United States); Baring-Gould, Ian [National Renewable Energy Lab. (NREL), Golden, CO (United States); Margolis, Robert [National Renewable Energy Lab. (NREL), Golden, CO (United States)

2016-02-01

The Distributed Generation Market Demand model (dGen) is a geospatially rich, bottom-up, market-penetration model that simulates the potential adoption of distributed energy resources (DERs) for residential, commercial, and industrial entities in the continental United States through 2050. The National Renewable Energy Laboratory (NREL) developed dGen to analyze the key factors that will affect future market demand for distributed solar, wind, storage, and other DER technologies in the United States. The new model builds off, extends, and replaces NREL's SolarDS model (Denholm et al. 2009a), which simulates the market penetration of distributed PV only. Unlike the SolarDS model, dGen can model various DER technologies under one platform--it currently can simulate the adoption of distributed solar (the dSolar module) and distributed wind (the dWind module) and link with the ReEDS capacity expansion model (Appendix C). The underlying algorithms and datasets in dGen, which improve the representation of customer decision making as well as the spatial resolution of analyses (Figure ES-1), also are improvements over SolarDS.
Solar Integration National Dataset Toolkit | Grid Modernization | NREL

Science.gov (United States)

Solar Integration National Dataset Toolkit Solar Integration National Dataset Toolkit NREL is working on a Solar Integration National Dataset (SIND) Toolkit to enable researchers to perform U.S . regional solar generation integration studies. It will provide modeled, coherent subhourly solar power data
X-ray computed tomography datasets for forensic analysis of vertebrate fossils

Science.gov (United States)

Rowe, Timothy B.; Luo, Zhe-Xi; Ketcham, Richard A.; Maisano, Jessica A.; Colbert, Matthew W.

2016-01-01

We describe X-ray computed tomography (CT) datasets from three specimens recovered from Early Cretaceous lakebeds of China that illustrate the forensic interpretation of CT imagery for paleontology. Fossil vertebrates from thinly bedded sediments often shatter upon discovery and are commonly repaired as amalgamated mosaics grouted to a solid backing slab of rock or plaster. Such methods are prone to inadvertent error and willful forgery, and once required potentially destructive methods to identify mistakes in reconstruction. CT is an efficient, nondestructive alternative that can disclose many clues about how a specimen was handled and repaired. These annotated datasets illustrate the power of CT in documenting specimen integrity and are intended as a reference in applying CT more broadly to evaluating the authenticity of comparable fossils. PMID:27272251
PROVIDING GEOGRAPHIC DATASETS AS LINKED DATA IN SDI

Directory of Open Access Journals (Sweden)

E. Hietanen

2016-06-01

Full Text Available In this study, a prototype service to provide data from Web Feature Service (WFS as linked data is implemented. At first, persistent and unique Uniform Resource Identifiers (URI are created to all spatial objects in the dataset. The objects are available from those URIs in Resource Description Framework (RDF data format. Next, a Web Ontology Language (OWL ontology is created to describe the dataset information content using the Open Geospatial Consortium’s (OGC GeoSPARQL vocabulary. The existing data model is modified in order to take into account the linked data principles. The implemented service produces an HTTP response dynamically. The data for the response is first fetched from existing WFS. Then the Geographic Markup Language (GML format output of the WFS is transformed on-the-fly to the RDF format. Content Negotiation is used to serve the data in different RDF serialization formats. This solution facilitates the use of a dataset in different applications without replicating the whole dataset. In addition, individual spatial objects in the dataset can be referred with URIs. Furthermore, the needed information content of the objects can be easily extracted from the RDF serializations available from those URIs. A solution for linking data objects to the dataset URI is also introduced by using the Vocabulary of Interlinked Datasets (VoID. The dataset is divided to the subsets and each subset is given its persistent and unique URI. This enables the whole dataset to be explored with a web browser and all individual objects to be indexed by search engines.
Wind Integration National Dataset Toolkit | Grid Modernization | NREL

Science.gov (United States)

Integration National Dataset Toolkit Wind Integration National Dataset Toolkit The Wind Integration National Dataset (WIND) Toolkit is an update and expansion of the Eastern Wind Integration Data Set and Western Wind Integration Data Set. It supports the next generation of wind integration studies. WIND

The Problem with Big Data: Operating on Smaller Datasets to Bridge the Implementation Gap.

Science.gov (United States)

Mann, Richard P; Mushtaq, Faisal; White, Alan D; Mata-Cervantes, Gabriel; Pike, Tom; Coker, Dalton; Murdoch, Stuart; Hiles, Tim; Smith, Clare; Berridge, David; Hinchliffe, Suzanne; Hall, Geoff; Smye, Stephen; Wilkie, Richard M; Lodge, J Peter A; Mon-Williams, Mark

2016-01-01

Big datasets have the potential to revolutionize public health. However, there is a mismatch between the political and scientific optimism surrounding big data and the public's perception of its benefit. We suggest a systematic and concerted emphasis on developing models derived from smaller datasets to illustrate to the public how big data can produce tangible benefits in the long term. In order to highlight the immediate value of a small data approach, we produced a proof-of-concept model predicting hospital length of stay. The results demonstrate that existing small datasets can be used to create models that generate a reasonable prediction, facilitating health-care delivery. We propose that greater attention (and funding) needs to be directed toward the utilization of existing information resources in parallel with current efforts to create and exploit "big data."
[Potential distribution and geographic characteristics of wild populations of Vanilla planifolia (Orchidaceae) Oaxaca, Mexico].

Science.gov (United States)

Hernandez-Ruiz, Jesús; Herrera-Cabrera, B Edgar; Delgado-Alvarado, Adriana; Salazar-Rojas, Víctor M; Bustamante-Gonzalez, Ángel; Campos-Contreras, Jorge E; Ramírez-Juarez, Javier

2016-03-01

Wild specimens of Vanilla planifolia represent a vital part of this resource primary gene pool, and some plants have only been reported in Oaxaca, Mexico. For this reason, we studied its geographical distribution within the state, to locate and describe the ecological characteristics of the areas where they have been found, in order to identify potential areas of establishment. The method comprised four stages: 1) the creation of a database with herbarium records, 2) the construction of the potential distribution based on historical herbarium records for the species, using the model of maximum entropy (MaxEnt) and 22 bioclimatic variables as predictors; 3) an in situ systematic search of individuals, based on herbarium records and areas of potential distribution in 24 municipalities, to determine the habitat current situation and distribution; 4) the description of the environmental factors of potential ecological niches generated by MaxEnt. A review of herbarium collections revealed a total of 18 records of V. planifolia between 1939 and 1998. The systematic search located 28 plants distributed in 12 sites in 95 364 Km(2). The most important variables that determined the model of vanilla potential distribution were: precipitation in the rainy season (61.9 %), soil moisture regime (23.4 %) and precipitation during the four months of highest rainfall (8.1 %). The species potential habitat was found to be distributed in four zones: wet tropics of the Gulf of Mexico, humid temperate, humid tropical, and humid temperate in the Pacific. Precipitation oscillated within the annual ranges of 2 500 to 4 000 mm, with summer rains, and winter precipitation as 5 to 10 % of the total. The moisture regime and predominating climate were udic type I (330 to 365 days of moisture) and hot humid (Am/A(C) m). The plants were located at altitudes of 200 to 1 190 masl, on rough hillsides that generally make up the foothills of mountain systems, with altitudes of 1 300 to 2 500 masl. In
SatelliteDL: a Toolkit for Analysis of Heterogeneous Satellite Datasets

Science.gov (United States)

Galloy, M. D.; Fillmore, D.

2014-12-01

SatelliteDL is an IDL toolkit for the analysis of satellite Earth observations from a diverse set of platforms and sensors. The core function of the toolkit is the spatial and temporal alignment of satellite swath and geostationary data. The design features an abstraction layer that allows for easy inclusion of new datasets in a modular way. Our overarching objective is to create utilities that automate the mundane aspects of satellite data analysis, are extensible and maintainable, and do not place limitations on the analysis itself. IDL has a powerful suite of statistical and visualization tools that can be used in conjunction with SatelliteDL. Toward this end we have constructed SatelliteDL to include (1) HTML and LaTeX API document generation,(2) a unit test framework,(3) automatic message and error logs,(4) HTML and LaTeX plot and table generation, and(5) several real world examples with bundled datasets available for download. For ease of use, datasets, variables and optional workflows may be specified in a flexible format configuration file. Configuration statements may specify, for example, a region and date range, and the creation of images, plots and statistical summary tables for a long list of variables. SatelliteDL enforces data provenance; all data should be traceable and reproducible. The output NetCDF file metadata holds a complete history of the original datasets and their transformations, and a method exists to reconstruct a configuration file from this information. Release 0.1.0 distributes with ingest methods for GOES, MODIS, VIIRS and CERES radiance data (L1) as well as select 2D atmosphere products (L2) such as aerosol and cloud (MODIS and VIIRS) and radiant flux (CERES). Future releases will provide ingest methods for ocean and land surface products, gridded and time averaged datasets (L3 Daily, Monthly and Yearly), and support for 3D products such as temperature and water vapor profiles. Emphasis will be on NPP Sensor, Environmental and
Validating a continental-scale groundwater diffuse pollution model using regional datasets.

Science.gov (United States)

Ouedraogo, Issoufou; Defourny, Pierre; Vanclooster, Marnik

2017-12-11

In this study, we assess the validity of an African-scale groundwater pollution model for nitrates. In a previous study, we identified a statistical continental-scale groundwater pollution model for nitrate. The model was identified using a pan-African meta-analysis of available nitrate groundwater pollution studies. The model was implemented in both Random Forest (RF) and multiple regression formats. For both approaches, we collected as predictors a comprehensive GIS database of 13 spatial attributes, related to land use, soil type, hydrogeology, topography, climatology, region typology, nitrogen fertiliser application rate, and population density. In this paper, we validate the continental-scale model of groundwater contamination by using a nitrate measurement dataset from three African countries. We discuss the issue of data availability, and quality and scale issues, as challenges in validation. Notwithstanding that the modelling procedure exhibited very good success using a continental-scale dataset (e.g. R 2 = 0.97 in the RF format using a cross-validation approach), the continental-scale model could not be used without recalibration to predict nitrate pollution at the country scale using regional data. In addition, when recalibrating the model using country-scale datasets, the order of model exploratory factors changes. This suggests that the structure and the parameters of a statistical spatially distributed groundwater degradation model for the African continent are strongly scale dependent.
Massive calculations of electrostatic potentials and structure maps of biopolymers in a distributed computing environment

International Nuclear Information System (INIS)

Akishina, T.P.; Ivanov, V.V.; Stepanenko, V.A.

2013-01-01

Among the key factors determining the processes of transcription and translation are the distributions of the electrostatic potentials of DNA, RNA and proteins. Calculations of electrostatic distributions and structure maps of biopolymers on computers are time consuming and require large computational resources. We developed the procedures for organization of massive calculations of electrostatic potentials and structure maps for biopolymers in a distributed computing environment (several thousands of cores).
Uncertainty Visualization Using Copula-Based Analysis in Mixed Distribution Models.

Science.gov (United States)

Hazarika, Subhashis; Biswas, Ayan; Shen, Han-Wei

2018-01-01

Distributions are often used to model uncertainty in many scientific datasets. To preserve the correlation among the spatially sampled grid locations in the dataset, various standard multivariate distribution models have been proposed in visualization literature. These models treat each grid location as a univariate random variable which models the uncertainty at that location. Standard multivariate distributions (both parametric and nonparametric) assume that all the univariate marginals are of the same type/family of distribution. But in reality, different grid locations show different statistical behavior which may not be modeled best by the same type of distribution. In this paper, we propose a new multivariate uncertainty modeling strategy to address the needs of uncertainty modeling in scientific datasets. Our proposed method is based on a statistically sound multivariate technique called Copula, which makes it possible to separate the process of estimating the univariate marginals and the process of modeling dependency, unlike the standard multivariate distributions. The modeling flexibility offered by our proposed method makes it possible to design distribution fields which can have different types of distribution (Gaussian, Histogram, KDE etc.) at the grid locations, while maintaining the correlation structure at the same time. Depending on the results of various standard statistical tests, we can choose an optimal distribution representation at each location, resulting in a more cost efficient modeling without significantly sacrificing on the analysis quality. To demonstrate the efficacy of our proposed modeling strategy, we extract and visualize uncertain features like isocontours and vortices in various real world datasets. We also study various modeling criterion to help users in the task of univariate model selection.
A New Outlier Detection Method for Multidimensional Datasets

KAUST Repository

Abdel Messih, Mario A.

2012-07-01

This study develops a novel hybrid method for outlier detection (HMOD) that combines the idea of distance based and density based methods. The proposed method has two main advantages over most of the other outlier detection methods. The first advantage is that it works well on both dense and sparse datasets. The second advantage is that, unlike most other outlier detection methods that require careful parameter setting and prior knowledge of the data, HMOD is not very sensitive to small changes in parameter values within certain parameter ranges. The only required parameter to set is the number of nearest neighbors. In addition, we made a fully parallelized implementation of HMOD that made it very efficient in applications. Moreover, we proposed a new way of using the outlier detection for redundancy reduction in datasets where the confidence level that evaluates how accurate the less redundant dataset can be used to represent the original dataset can be specified by users. HMOD is evaluated on synthetic datasets (dense and mixed “dense and sparse”) and a bioinformatics problem of redundancy reduction of dataset of position weight matrices (PWMs) of transcription factor binding sites. In addition, in the process of assessing the performance of our redundancy reduction method, we developed a simple tool that can be used to evaluate the confidence level of reduced dataset representing the original dataset. The evaluation of the results shows that our method can be used in a wide range of problems.
Using ERA-Interim reanalysis for creating datasets of energy-relevant climate variables

Science.gov (United States)

Jones, Philip D.; Harpham, Colin; Troccoli, Alberto; Gschwind, Benoit; Ranchin, Thierry; Wald, Lucien; Goodess, Clare M.; Dorling, Stephen

2017-07-01

The construction of a bias-adjusted dataset of climate variables at the near surface using ERA-Interim reanalysis is presented. A number of different, variable-dependent, bias-adjustment approaches have been proposed. Here we modify the parameters of different distributions (depending on the variable), adjusting ERA-Interim based on gridded station or direct station observations. The variables are air temperature, dewpoint temperature, precipitation (daily only), solar radiation, wind speed, and relative humidity. These are available on either 3 or 6 h timescales over the period 1979-2016. The resulting bias-adjusted dataset is available through the Climate Data Store (CDS) of the Copernicus Climate Change Data Store (C3S) and can be accessed at present from climate.copernicus.eu" target="_blank">ftp://ecem.climate.copernicus.eu. The benefit of performing bias adjustment is demonstrated by comparing initial and bias-adjusted ERA-Interim data against gridded observational fields.
The Low-Noise Potential of Distributed Propulsion on a Catamaran Aircraft

Science.gov (United States)

Posey, Joe W.; Tinetti, A. F.; Dunn, M. H.

2006-01-01

The noise shielding potential of an inboard-wing catamaran aircraft when coupled with distributed propulsion is examined. Here, only low-frequency jet noise from mid-wing-mounted engines is considered. Because low frequencies are the most difficult to shield, these calculations put a lower bound on the potential shielding benefit. In this proof-of-concept study, simple physical models are used to describe the 3-D scattering of jet noise by conceptualized catamaran aircraft. The Fast Scattering Code is used to predict noise levels on and about the aircraft. Shielding results are presented for several catamaran type geometries and simple noise source configurations representative of distributed propulsion radiation. Computational analyses are presented that demonstrate the shielding benefits of distributed propulsion and of increasing the width of the inboard wing. Also, sample calculations using the FSC are presented that demonstrate additional noise reduction on the aircraft fuselage by the use of acoustic liners on the inboard wing trailing edge. A full conceptual aircraft design would have to be analyzed over a complete mission to more accurately quantify community noise levels and aircraft performance, but the present shielding calculations show that a large acoustic benefit could be achieved by combining distributed propulsion and liner technology with a twin-fuselage planform.
[Effect of pulse magnetic field on distribution of neuronal action potential].

Science.gov (United States)

Zheng, Yu; Cai, Di; Wang, Jin-Hai; Li, Gang; Lin, Ling

2014-08-25

The biological effect on the organism generated by magnetic field is widely studied. The present study was aimed to observe the change of sodium channel under magnetic field in neurons. Cortical neurons of Kunming mice were isolated, subjected to 15 Hz, 1 mT pulse magnetic stimulation, and then the currents of neurons were recorded by whole-cell patch clamp. The results showed that, under magnetic stimulation, the activation process of Na(+) channel was delayed, and the inactivation process was accelerated. Given the classic three-layer model, the polarization diagram of cell membrane potential distribution under pulse magnetic field was simulated, and it was found that the membrane potential induced was associated with the frequency and intensity of magnetic field. Also the effect of magnetic field-induced current on action potential was simulated by Hodgkin-Huxley (H-H) model. The result showed that the generation of action potential was delayed, and frequency and the amplitudes were decreased when working current was between -1.32 μA and 0 μA. When the working current was higher than 0 μA, the generation frequency of action potential was increased, and the change of amplitudes was not obvious, and when the working current was lower than -1.32 μA, the time of rising edge and amplitudes of action potential were decreased drastically, and the action potential was unable to generate. These results suggest that the magnetic field simulation can affect the distribution frequency and amplitude of action potential of neuron via sodium channel mediation.
NP-PAH Interaction Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration...
A dataset on tail risk of commodities markets.

Science.gov (United States)

Powell, Robert J; Vo, Duc H; Pham, Thach N; Singh, Abhay K

2017-12-01

This article contains the datasets related to the research article "The long and short of commodity tails and their relationship to Asian equity markets"(Powell et al., 2017) [1]. The datasets contain the daily prices (and price movements) of 24 different commodities decomposed from the S&P GSCI index and the daily prices (and price movements) of three share market indices including World, Asia, and South East Asia for the period 2004-2015. Then, the dataset is divided into annual periods, showing the worst 5% of price movements for each year. The datasets are convenient to examine the tail risk of different commodities as measured by Conditional Value at Risk (CVaR) as well as their changes over periods. The datasets can also be used to investigate the association between commodity markets and share markets.
Feedback control in deep drawing based on experimental datasets

Science.gov (United States)

Fischer, P.; Heingärtner, J.; Aichholzer, W.; Hortig, D.; Hora, P.

2017-09-01

In large-scale production of deep drawing parts, like in automotive industry, the effects of scattering material properties as well as warming of the tools have a significant impact on the drawing result. In the scope of the work, an approach is presented to minimize the influence of these effects on part quality by optically measuring the draw-in of each part and adjusting the settings of the press to keep the strain distribution, which is represented by the draw-in, inside a certain limit. For the design of the control algorithm, a design of experiments for in-line tests is used to quantify the influence of the blank holder force as well as the force distribution on the draw-in. The results of this experimental dataset are used to model the process behavior. Based on this model, a feedback control loop is designed. Finally, the performance of the control algorithm is validated in the production line.
[Research on developping the spectral dataset for Dunhuang typical colors based on color constancy].

Science.gov (United States)

Liu, Qiang; Wan, Xiao-Xia; Liu, Zhen; Li, Chan; Liang, Jin-Xing

2013-11-01

The present paper aims at developping a method to reasonably set up the typical spectral color dataset for different kinds of Chinese cultural heritage in color rendering process. The world famous wall paintings dating from more than 1700 years ago in Dunhuang Mogao Grottoes was taken as typical case in this research. In order to maintain the color constancy during the color rendering workflow of Dunhuang culture relics, a chromatic adaptation based method for developping the spectral dataset of typical colors for those wall paintings was proposed from the view point of human vision perception ability. Under the help and guidance of researchers in the art-research institution and protection-research institution of Dunhuang Academy and according to the existing research achievement of Dunhuang Research in the past years, 48 typical known Dunhuang pigments were chosen and 240 representative color samples were made with reflective spectral ranging from 360 to 750 nm was acquired by a spectrometer. In order to find the typical colors of the above mentioned color samples, the original dataset was devided into several subgroups by clustering analysis. The grouping number, together with the most typical samples for each subgroup which made up the firstly built typical color dataset, was determined by wilcoxon signed rank test according to the color inconstancy index comprehensively calculated under 6 typical illuminating conditions. Considering the completeness of gamut of Dunhuang wall paintings, 8 complementary colors was determined and finally the typical spectral color dataset was built up which contains 100 representative spectral colors. The analytical calculating results show that the median color inconstancy index of the built dataset in 99% confidence level by wilcoxon signed rank test was 3.28 and the 100 colors are distributing in the whole gamut uniformly, which ensures that this dataset can provide reasonable reference for choosing the color with highest
Intensity-Duration-Frequency curves from remote sensing datasets: direct comparison of weather radar and CMORPH over the Eastern Mediterranean

Science.gov (United States)

Morin, Efrat; Marra, Francesco; Peleg, Nadav; Mei, Yiwen; Anagnostou, Emmanouil N.

2017-04-01

Rainfall frequency analysis is used to quantify the probability of occurrence of extreme rainfall and is traditionally based on rain gauge records. The limited spatial coverage of rain gauges is insufficient to sample the spatiotemporal variability of extreme rainfall and to provide the areal information required by management and design applications. Conversely, remote sensing instruments, even if quantitative uncertain, offer coverage and spatiotemporal detail that allow overcoming these issues. In recent years, remote sensing datasets began to be used for frequency analyses, taking advantage of increased record lengths and quantitative adjustments of the data. However, the studies so far made use of concepts and techniques developed for rain gauge (i.e. point or multiple-point) data and have been validated by comparison with gauge-derived analyses. These procedures add further sources of uncertainty and prevent from isolating between data and methodological uncertainties and from fully exploiting the available information. In this study, we step out of the gauge-centered concept presenting a direct comparison between at-site Intensity-Duration-Frequency (IDF) curves derived from different remote sensing datasets on corresponding spatial scales, temporal resolutions and records. We analyzed 16 years of homogeneously corrected and gauge-adjusted C-Band weather radar estimates, high-resolution CMORPH and gauge-adjusted high-resolution CMORPH over the Eastern Mediterranean. Results of this study include: (a) good spatial correlation between radar and satellite IDFs ( 0.7 for 2-5 years return period); (b) consistent correlation and dispersion in the raw and gauge adjusted CMORPH; (c) bias is almost uniform with return period for 12-24 h durations; (d) radar identifies thicker tail distributions than CMORPH and the tail of the distributions depends on the spatial and temporal scales. These results demonstrate the potential of remote sensing datasets for rainfall
On potential distribution in accelerating structure with RF-quadrupole focusing

International Nuclear Information System (INIS)

Lymar', A.G.; Martynenko, P.A.; Khizhnyak, N.A.

1993-01-01

Results of the calculation of electric potential distribution between electrodes of an accelerating system, which is drift tubes arranged in the form of match boxes are presented. Three-dimensional Laplace equation solved by the finite-difference method has been used in the calculations. 6 refs., 1 fig
A bivariate contaminated binormal model for robust fitting of proper ROC curves to a pair of correlated, possibly degenerate, ROC datasets.

Science.gov (United States)

Zhai, Xuetong; Chakraborty, Dev P

2017-06-01

The objective was to design and implement a bivariate extension to the contaminated binormal model (CBM) to fit paired receiver operating characteristic (ROC) datasets-possibly degenerate-with proper ROC curves. Paired datasets yield two correlated ratings per case. Degenerate datasets have no interior operating points and proper ROC curves do not inappropriately cross the chance diagonal. The existing method, developed more than three decades ago utilizes a bivariate extension to the binormal model, implemented in CORROC2 software, which yields improper ROC curves and cannot fit degenerate datasets. CBM can fit proper ROC curves to unpaired (i.e., yielding one rating per case) and degenerate datasets, and there is a clear scientific need to extend it to handle paired datasets. In CBM, nondiseased cases are modeled by a probability density function (pdf) consisting of a unit variance peak centered at zero. Diseased cases are modeled with a mixture distribution whose pdf consists of two unit variance peaks, one centered at positive μ with integrated probability α, the mixing fraction parameter, corresponding to the fraction of diseased cases where the disease was visible to the radiologist, and one centered at zero, with integrated probability (1-α), corresponding to disease that was not visible. It is shown that: (a) for nondiseased cases the bivariate extension is a unit variances bivariate normal distribution centered at (0,0) with a specified correlation ρ 1 ; (b) for diseased cases the bivariate extension is a mixture distribution with four peaks, corresponding to disease not visible in either condition, disease visible in only one condition, contributing two peaks, and disease visible in both conditions. An expression for the likelihood function is derived. A maximum likelihood estimation (MLE) algorithm, CORCBM, was implemented in the R programming language that yields parameter estimates and the covariance matrix of the parameters, and other statistics
Proteomics dataset

DEFF Research Database (Denmark)

Bennike, Tue Bjerg; Carlsen, Thomas Gelsing; Ellingsen, Torkell

2017-01-01

patients (Morgan et al., 2012; Abraham and Medzhitov, 2011; Bennike, 2014) [8–10. Therefore, we characterized the proteome of colon mucosa biopsies from 10 inflammatory bowel disease ulcerative colitis (UC) patients, 11 gastrointestinal healthy rheumatoid arthritis (RA) patients, and 10 controls. We...... been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD001608 for ulcerative colitis and control samples, and PXD003082 for rheumatoid arthritis samples....
Comparison of Shallow Survey 2012 Multibeam Datasets

Science.gov (United States)

Ramirez, T. M.

2012-12-01

The purpose of the Shallow Survey common dataset is a comparison of the different technologies utilized for data acquisition in the shallow survey marine environment. The common dataset consists of a series of surveys conducted over a common area of seabed using a variety of systems. It provides equipment manufacturers the opportunity to showcase their latest systems while giving hydrographic researchers and scientists a chance to test their latest algorithms on the dataset so that rigorous comparisons can be made. Five companies collected data for the Common Dataset in the Wellington Harbor area in New Zealand between May 2010 and May 2011; including Kongsberg, Reson, R2Sonic, GeoAcoustics, and Applied Acoustics. The Wellington harbor and surrounding coastal area was selected since it has a number of well-defined features, including the HMNZS South Seas and HMNZS Wellington wrecks, an armored seawall constructed of Tetrapods and Akmons, aquifers, wharves and marinas. The seabed inside the harbor basin is largely fine-grained sediment, with gravel and reefs around the coast. The area outside the harbor on the southern coast is an active environment, with moving sand and exposed reefs. A marine reserve is also in this area. For consistency between datasets, the coastal research vessel R/V Ikatere and crew were used for all surveys conducted for the common dataset. Using Triton's Perspective processing software multibeam datasets collected for the Shallow Survey were processed for detail analysis. Datasets from each sonar manufacturer were processed using the CUBE algorithm developed by the Center for Coastal and Ocean Mapping/Joint Hydrographic Center (CCOM/JHC). Each dataset was gridded at 0.5 and 1.0 meter resolutions for cross comparison and compliance with International Hydrographic Organization (IHO) requirements. Detailed comparisons were made of equipment specifications (transmit frequency, number of beams, beam width), data density, total uncertainty, and
Investigating country-specific music preferences and music recommendation algorithms with the LFM-1b dataset.

Science.gov (United States)

Schedl, Markus

2017-01-01

Recently, the LFM-1b dataset has been proposed to foster research and evaluation in music retrieval and music recommender systems, Schedl (Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR). New York, 2016). It contains more than one billion music listening events created by more than 120,000 users of Last.fm. Each listening event is characterized by artist, album, and track name, and further includes a timestamp. Basic demographic information and a selection of more elaborate listener-specific descriptors are included as well, for anonymized users. In this article, we reveal information about LFM-1b's acquisition and content and we compare it to existing datasets. We furthermore provide an extensive statistical analysis of the dataset, including basic properties of the item sets, demographic coverage, distribution of listening events (e.g., over artists and users), and aspects related to music preference and consumption behavior (e.g., temporal features and mainstreaminess of listeners). Exploiting country information of users and genre tags of artists, we also create taste profiles for populations and determine similar and dissimilar countries in terms of their populations' music preferences. Finally, we illustrate the dataset's usage in a simple artist recommendation task, whose results are intended to serve as baseline against which more elaborate techniques can be assessed.

National Hydrography Dataset (NHD)

Data.gov (United States)

Kansas Data Access and Support Center — The National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that comprise the...
The Harvard organic photovoltaic dataset.

Science.gov (United States)

Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán

2016-09-27

The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.
Risk behaviours among internet-facilitated sex workers: evidence from two new datasets.

Science.gov (United States)

Cunningham, Scott; Kendall, Todd D

2010-12-01

Sex workers have historically played a central role in STI outbreaks by forming a core group for transmission and due to their higher rates of concurrency and inconsistent condom usage. Over the past 15 years, North American commercial sex markets have been radically reorganised by internet technologies that channelled a sizeable share of the marketplace online. These changes may have had a meaningful impact on the role that sex workers play in STI epidemics. In this study, two new datasets documenting the characteristics and practices of internet-facilitated sex workers are presented and analysed. The first dataset comes from a ratings website where clients share detailed information on over 94,000 sex workers in over 40 cities between 1999 and 2008. The second dataset reflects a year-long field survey of 685 sex workers who advertise online. Evidence from these datasets suggests that internet-facilitated sex workers are dissimilar from the street-based workers who largely populated the marketplace in earlier eras. Differences in characteristics and practices were found which suggest a lower potential for the spread of STIs among internet-facilitated sex workers. The internet-facilitated population appears to include a high proportion of sex workers who are well-educated, hold health insurance and operate only part time. They also engage in relatively low levels of risky sexual practices.
Tables and figure datasets

Data.gov (United States)

U.S. Environmental Protection Agency — Soil and air concentrations of asbestos in Sumas study. This dataset is associated with the following publication: Wroble, J., T. Frederick, A. Frame, and D....
Experiments to Distribute Map Generalization Processes

Science.gov (United States)

Berli, Justin; Touya, Guillaume; Lokhat, Imran; Regnauld, Nicolas

2018-05-01

Automatic map generalization requires the use of computationally intensive processes often unable to deal with large datasets. Distributing the generalization process is the only way to make them scalable and usable in practice. But map generalization is a highly contextual process, and the surroundings of a generalized map feature needs to be known to generalize the feature, which is a problem as distribution might partition the dataset and parallelize the processing of each part. This paper proposes experiments to evaluate the past propositions to distribute map generalization, and to identify the main remaining issues. The past propositions to distribute map generalization are first discussed, and then the experiment hypotheses and apparatus are described. The experiments confirmed that regular partitioning was the quickest strategy, but also the less effective in taking context into account. The geographical partitioning, though less effective for now, is quite promising regarding the quality of the results as it better integrates the geographical context.
Creating a regional MODIS satellite-driven net primary production dataset for european forests

NARCIS (Netherlands)

Neumann, Mathias; Moreno, Adam; Thurnher, Christopher; Mues, Volker; Härkönen, Sanna; Mura, Matteo; Bouriaud, Olivier; Lang, Mait; Cardellini, Giuseppe; Thivolle-Cazat, Alain; Bronisz, Karol; Merganic, Jan; Alberdi, Iciar; Astrup, Rasmus; Mohren, Frits; Zhao, Maosheng; Hasenauer, Hubert

2016-01-01

Net primary production (NPP) is an important ecological metric for studying forest ecosystems and their carbon sequestration, for assessing the potential supply of food or timber and quantifying the impacts of climate change on ecosystems. The global MODIS NPP dataset using the MOD17 algorithm
Intrinsic potential of cell membranes: opposite effects of lipid transmembrane asymmetry and asymmetric salt ion distribution

DEFF Research Database (Denmark)

Gurtovenko, Andrey A; Vattulainen, Ilpo

2009-01-01

Using atomic-scale molecular dynamics simulations, we consider the intrinsic cell membrane potential that is found to originate from a subtle interplay between lipid transmembrane asymmetry and the asymmetric distribution of monovalent salt ions on the two sides of the cell membrane. It turns out......Cl saline solution and the PE leaflet is exposed to KCl, the outcome is that the effects of asymmetric lipid and salt ion distributions essentially cancel one another almost completely. Overall, our study highlights the complex nature of the intrinsic potential of cell membranes under physiological...... that both the asymmetric distribution of phosphatidylcholine (PC) and phosphatidylethanolamine (PE) lipids across a membrane and the asymmetric distribution of NaCl and KCl induce nonzero drops in the transmembrane potential. However, these potential drops are opposite in sign. As the PC leaflet faces a Na...
An economic evaluation of the potential for distributed energy in Australia

International Nuclear Information System (INIS)

Lilley, William E.; Reedman, Luke J.; Wagner, Liam D.; Alie, Colin F.; Szatow, Anthony R.

2012-01-01

We present here economic findings from a major study by Australia's Commonwealth Scientific and Industrial Research Organisation (CSIRO) on the value of distributed energy technologies (DE; collectively demand management, energy efficiency and distributed generation) for reducing greenhouse gas emissions from Australia's energy sector (CSIRO, 2009). The study covered potential economic, environmental, technical, social, policy and regulatory impacts that could result from their wide scale adoption. Partial Equilibrium modeling of the stationary energy and transport sectors found that Australia could achieve a present value welfare gain of around $130 billion when operating under a 450 ppm carbon reduction trajectory through to 2050. Modeling also suggests that reduced volatility in the spot market could decrease average prices by up to 12% in 2030 and 65% in 2050 by using local resources to better cater for an evolving supply–demand imbalance. Further modeling suggests that even a small amount of distributed generation located within a distribution network has the potential to significantly alter electricity prices by changing the merit order of dispatch in an electricity spot market. Changes to the dispatch relative to a base case can have both positive and negative effects on network losses. - Highlights: ► Quantified impact of distributed generation (DG) on the Australian energy sector. ► Australia could achieve a welfare gain of around $130 billion through to 2050. ► Wholesale market modeling found that DG led to lower price levels and volatility. ► DG has impacts on the transmission system in terms of dispatch and system losses.
Potential worldwide distribution of Fusarium dry root rot in common beans based on the optimal environment for disease occurrence.

Science.gov (United States)

Macedo, Renan; Sales, Lilian Patrícia; Yoshida, Fernanda; Silva-Abud, Lidianne Lemes; Lobo, Murillo

2017-01-01

Root rots are a constraint for staple food crops and a long-lasting food security problem worldwide. In common beans, yield losses originating from root damage are frequently attributed to dry root rot, a disease caused by the Fusarium solani species complex. The aim of this study was to model the current potential distribution of common bean dry root rot on a global scale and to project changes based on future expectations of climate change. Our approach used a spatial proxy of the field disease occurrence, instead of solely the pathogen distribution. We modeled the pathogen environmental requirements in locations where in-situ inoculum density seems ideal for disease manifestation. A dataset of 2,311 soil samples from commercial farms assessed from 2002 to 2015 allowed us to evaluate the environmental conditions associated with the pathogen's optimum inoculum density for disease occurrence, using a lower threshold as a spatial proxy. We encompassed not only the optimal conditions for disease occurrence but also the optimal pathogen's density required for host infection. An intermediate inoculum density of the pathogen was the best disease proxy, suggesting density-dependent mechanisms on host infection. We found a strong convergence on the environmental requirements of both the host and the disease development in tropical areas, mostly in Brazil, Central America, and African countries. Precipitation and temperature variables were important for explaining the disease occurrence (from 17.63% to 43.84%). Climate change will probably move the disease toward cooler regions, which in Brazil are more representative of small-scale farming, although an overall shrink in total area (from 48% to 49% in 2050 and 26% to 41% in 2070) was also predicted. Understanding pathogen distribution and disease risks in an evolutionary context will therefore support breeding for resistance programs and strategies for dry root rot management in common beans.
Potential worldwide distribution of Fusarium dry root rot in common beans based on the optimal environment for disease occurrence.

Directory of Open Access Journals (Sweden)

Renan Macedo

Full Text Available Root rots are a constraint for staple food crops and a long-lasting food security problem worldwide. In common beans, yield losses originating from root damage are frequently attributed to dry root rot, a disease caused by the Fusarium solani species complex. The aim of this study was to model the current potential distribution of common bean dry root rot on a global scale and to project changes based on future expectations of climate change. Our approach used a spatial proxy of the field disease occurrence, instead of solely the pathogen distribution. We modeled the pathogen environmental requirements in locations where in-situ inoculum density seems ideal for disease manifestation. A dataset of 2,311 soil samples from commercial farms assessed from 2002 to 2015 allowed us to evaluate the environmental conditions associated with the pathogen's optimum inoculum density for disease occurrence, using a lower threshold as a spatial proxy. We encompassed not only the optimal conditions for disease occurrence but also the optimal pathogen's density required for host infection. An intermediate inoculum density of the pathogen was the best disease proxy, suggesting density-dependent mechanisms on host infection. We found a strong convergence on the environmental requirements of both the host and the disease development in tropical areas, mostly in Brazil, Central America, and African countries. Precipitation and temperature variables were important for explaining the disease occurrence (from 17.63% to 43.84%. Climate change will probably move the disease toward cooler regions, which in Brazil are more representative of small-scale farming, although an overall shrink in total area (from 48% to 49% in 2050 and 26% to 41% in 2070 was also predicted. Understanding pathogen distribution and disease risks in an evolutionary context will therefore support breeding for resistance programs and strategies for dry root rot management in common beans.
PHYSICS PERFORMANCE AND DATASET (PPD)

CERN Multimedia

L. Silvestris

2013-01-01

The first part of the Long Shutdown period has been dedicated to the preparation of the samples for the analysis targeting the summer conferences. In particular, the 8 TeV data acquired in 2012, including most of the “parked datasets”, have been reconstructed profiting from improved alignment and calibration conditions for all the sub-detectors. A careful planning of the resources was essential in order to deliver the datasets well in time to the analysts, and to schedule the update of all the conditions and calibrations needed at the analysis level. The newly reprocessed data have undergone detailed scrutiny by the Dataset Certification team allowing to recover some of the data for analysis usage and further improving the certification efficiency, which is now at 91% of the recorded luminosity. With the aim of delivering a consistent dataset for 2011 and 2012, both in terms of conditions and release (53X), the PPD team is now working to set up a data re-reconstruction and a new MC pro...
Overview of the CERES Edition-4 Multilayer Cloud Property Datasets

Science.gov (United States)

Chang, F. L.; Minnis, P.; Sun-Mack, S.; Chen, Y.; Smith, R. A.; Brown, R. R.

2014-12-01

Knowledge of the cloud vertical distribution is important for understanding the role of clouds on earth's radiation budget and climate change. Since high-level cirrus clouds with low emission temperatures and small optical depths can provide a positive feedback to a climate system and low-level stratus clouds with high emission temperatures and large optical depths can provide a negative feedback effect, the retrieval of multilayer cloud properties using satellite observations, like Terra and Aqua MODIS, is critically important for a variety of cloud and climate applications. For the objective of the Clouds and the Earth's Radiant Energy System (CERES), new algorithms have been developed using Terra and Aqua MODIS data to allow separate retrievals of cirrus and stratus cloud properties when the two dominant cloud types are simultaneously present in a multilayer system. In this paper, we will present an overview of the new CERES Edition-4 multilayer cloud property datasets derived from Terra as well as Aqua. Assessment of the new CERES multilayer cloud datasets will include high-level cirrus and low-level stratus cloud heights, pressures, and temperatures as well as their optical depths, emissivities, and microphysical properties.
A public dataset of overground and treadmill walking kinematics and kinetics in healthy individuals

Directory of Open Access Journals (Sweden)

Claudiane A. Fukuchi

2018-04-01

Full Text Available In a typical clinical gait analysis, the gait patterns of pathological individuals are commonly compared with the typically faster, comfortable pace of healthy subjects. However, due to potential bias related to gait speed, this comparison may not be valid. Publicly available gait datasets have failed to address this issue. Therefore, the goal of this study was to present a publicly available dataset of 42 healthy volunteers (24 young adults and 18 older adults who walked both overground and on a treadmill at a range of gait speeds. Their lower-extremity and pelvis kinematics were measured using a three-dimensional (3D motion-capture system. The external forces during both overground and treadmill walking were collected using force plates and an instrumented treadmill, respectively. The results include both raw and processed kinematic and kinetic data in different file formats: c3d and ASCII files. In addition, a metadata file is provided that contain demographic and anthropometric data and data related to each file in the dataset. All data are available at Figshare (DOI: 10.6084/m9.figshare.5722711. We foresee several applications of this public dataset, including to examine the influences of speed, age, and environment (overground vs. treadmill on gait biomechanics, to meet educational needs, and, with the inclusion of additional participants, to use as a normative dataset.
Integrated Surface Dataset (Global)

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — The Integrated Surface (ISD) Dataset (ISD) is composed of worldwide surface weather observations from over 35,000 stations, though the best spatial coverage is...
Aaron Journal article datasets

Data.gov (United States)

U.S. Environmental Protection Agency — All figures used in the journal article are in netCDF format. This dataset is associated with the following publication: Sims, A., K. Alapaty , and S. Raman....
Market Squid Ecology Dataset

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — This dataset contains ecological information collected on the major adult spawning and juvenile habitats of market squid off California and the US Pacific Northwest....
Adaptive distributional extensions to DFR ranking

DEFF Research Database (Denmark)

Petersen, Casper; Simonsen, Jakob Grue; Järvelin, Kalervo

2016-01-01

-fitting distribution. We call this model Adaptive Distributional Ranking (ADR) because it adapts the ranking to the statistics of the specific dataset being processed each time. Experiments on TREC data show ADR to outperform DFR models (and their extensions) and be comparable in performance to a query likelihood...
fCCAC: functional canonical correlation analysis to evaluate covariance between nucleic acid sequencing datasets.

Science.gov (United States)

Madrigal, Pedro

2017-03-01

Computational evaluation of variability across DNA or RNA sequencing datasets is a crucial step in genomic science, as it allows both to evaluate reproducibility of biological or technical replicates, and to compare different datasets to identify their potential correlations. Here we present fCCAC, an application of functional canonical correlation analysis to assess covariance of nucleic acid sequencing datasets such as chromatin immunoprecipitation followed by deep sequencing (ChIP-seq). We show how this method differs from other measures of correlation, and exemplify how it can reveal shared covariance between histone modifications and DNA binding proteins, such as the relationship between the H3K4me3 chromatin mark and its epigenetic writers and readers. An R/Bioconductor package is available at http://bioconductor.org/packages/fCCAC/ . pmb59@cam.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Effect of the Curved Spacetime on the Electrostatic Potential Energy Distribution of Strange Stars

Institute of Scientific and Technical Information of China (English)

陈次星; 张家铝

2001-01-01

The effect of the strong gravitational field of the strange core of a strange star on its surface electrostatic potential energy distribution is discussed. We present the general-relativistic hydrodynamics equations of fluids in the presence of the electric fields and investigate the surface electrostatic potential distribution of the strange core of a strange star in hydrostatic equilibrium to correct Alcock and coworker's result [Astrophys. J. 310 (1986) 261]. Also, we discuss the temperature distribution of the bare strange star surface and give the related formulae, which may be useful if we are concerned further about the physical processes near the quark atter surfaces of strange stars.
Projecting future expansion of invasive species: comparing and improving methodologies for species distribution modeling.

Science.gov (United States)

Mainali, Kumar P; Warren, Dan L; Dhileepan, Kunjithapatham; McConnachie, Andrew; Strathie, Lorraine; Hassan, Gul; Karki, Debendra; Shrestha, Bharat B; Parmesan, Camille

2015-12-01

Modeling the distributions of species, especially of invasive species in non-native ranges, involves multiple challenges. Here, we developed some novel approaches to species distribution modeling aimed at reducing the influences of such challenges and improving the realism of projections. We estimated species-environment relationships for Parthenium hysterophorus L. (Asteraceae) with four modeling methods run with multiple scenarios of (i) sources of occurrences and geographically isolated background ranges for absences, (ii) approaches to drawing background (absence) points, and (iii) alternate sets of predictor variables. We further tested various quantitative metrics of model evaluation against biological insight. Model projections were very sensitive to the choice of training dataset. Model accuracy was much improved using a global dataset for model training, rather than restricting data input to the species' native range. AUC score was a poor metric for model evaluation and, if used alone, was not a useful criterion for assessing model performance. Projections away from the sampled space (i.e., into areas of potential future invasion) were very different depending on the modeling methods used, raising questions about the reliability of ensemble projections. Generalized linear models gave very unrealistic projections far away from the training region. Models that efficiently fit the dominant pattern, but exclude highly local patterns in the dataset and capture interactions as they appear in data (e.g., boosted regression trees), improved generalization of the models. Biological knowledge of the species and its distribution was important in refining choices about the best set of projections. A post hoc test conducted on a new Parthenium dataset from Nepal validated excellent predictive performance of our 'best' model. We showed that vast stretches of currently uninvaded geographic areas on multiple continents harbor highly suitable habitats for parthenium

ATLAS File and Dataset Metadata Collection and Use

CERN Document Server

Albrand, S; The ATLAS collaboration; Lambert, F; Gallas, E J

2012-01-01

The ATLAS Metadata Interface (“AMI”) was designed as a generic cataloguing system, and as such it has found many uses in the experiment including software release management, tracking of reconstructed event sizes and control of dataset nomenclature. The primary use of AMI is to provide a catalogue of datasets (file collections) which is searchable using physics criteria. In this paper we discuss the various mechanisms used for filling the AMI dataset and file catalogues. By correlating information from different sources we can derive aggregate information which is important for physics analysis; for example the total number of events contained in dataset, and possible reasons for missing events such as a lost file. Finally we will describe some specialized interfaces which were developed for the Data Preparation and reprocessing coordinators. These interfaces manipulate information from both the dataset domain held in AMI, and the run-indexed information held in the ATLAS COMA application (Conditions and ...
Norwegian Hydrological Reference Dataset for Climate Change Studies

Energy Technology Data Exchange (ETDEWEB)

Magnussen, Inger Helene; Killingland, Magnus; Spilde, Dag

2012-07-01

Based on the Norwegian hydrological measurement network, NVE has selected a Hydrological Reference Dataset for studies of hydrological change. The dataset meets international standards with high data quality. It is suitable for monitoring and studying the effects of climate change on the hydrosphere and cryosphere in Norway. The dataset includes streamflow, groundwater, snow, glacier mass balance and length change, lake ice and water temperature in rivers and lakes.(Author)
The Harvard organic photovoltaic dataset

Science.gov (United States)

Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán

2016-01-01

The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312
Synthetic and Empirical Capsicum Annuum Image Dataset

NARCIS (Netherlands)

Barth, R.

2016-01-01

This dataset consists of per-pixel annotated synthetic (10500) and empirical images (50) of Capsicum annuum, also known as sweet or bell pepper, situated in a commercial greenhouse. Furthermore, the source models to generate the synthetic images are included. The aim of the datasets are to
Interaction of landscape varibles on the potential geographical distribution of parrots in the Yucatan Peninsula, Mexico

Directory of Open Access Journals (Sweden)

Plasencia–Vázquez, A. H.

2014-12-01

Full Text Available The loss, degradation, and fragmentation of forested areas are endangering parrot populations. In this study, we determined the influence of fragmentation in relation to vegetation cover, land use, and spatial configuration of fragments on the potential geographical distribution patterns of parrots in the Yucatan Peninsula, Mexico. We used the potential geographical distribution for eight parrot species, considering the recently published maps obtained with the maximum entropy algorithm, and we incorporated the probability distribution for each species. We calculated 71 metrics/variables that evaluate forest fragmentation, spatial configuration of fragments, the ratio occupied by vegetation, and the land use in 100 plots of approximately 29 km², randomly distributed within the presence and absence areas predicted for each species. We also considered the relationship between environmental variables and the distribution probability of species. We used a partial least squares regression to explore patterns between the variables used and the potential distribution models. None of the environmental variables analyzed alone determined the presence/absence or the probability distribution of parrots in the Peninsula. We found that for the eight species, either due to the presence/absence or the probability distribution, the most important explanatory variables were the interaction among three variables, particularly the interactions among the total forest area, the total edge, and the tropical semi–evergreen medium– height forest. Habitat fragmentation influenced the potential geographical distribution of these species in terms of the characteristics of other environmental factors that are expressed together with the geographical division, such as the different vegetation cover ratio and land uses in deforested areas.
Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.

Science.gov (United States)

Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias

2011-01-01

The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
Hydro power potentials of water distribution networks in public universities: A case study

Directory of Open Access Journals (Sweden)

Olufemi Adebola KOYA

2017-06-01

Full Text Available Public Universities in Southwestern Nigeria are densely populated student-resident campuses, so that provision of regular potable water and electricity are important, but power supply is not optimally available for all the necessary activities. This study assesses the hydropower potential of the water distribution networks in the Universities, with the view to augmenting the inadequate power supplies. The institutions with water distribution configuration capable of accommodating in-pipe turbine are identified; the hydropower parameters, such as the flow characteristics and the pipe geometry are determined to estimate the water power. Global positioning device is used in estimating the elevations of the distribution reservoirs and the nodal points. The hydropower potential of each location is computed incorporating Lucid® Lift-based spherical turbine in the pipeline. From the analysis, the lean and the peak water power are between 1.92 – 3.30 kW and 3.95 – 7.24 kW, respectively, for reservoir-fed distribution networks; while, a minimum of 0.72 kW is got for pipelines associated with borehole-fed overhead tanks. Possible applications of electricity generation from the water distribution networks of the public universities are recommended.
Modeling the potential distribution of three lichens of the Xanthoparmelia pulla group (Parmeliaceae, Ascomycota in Central Europe

Directory of Open Access Journals (Sweden)

Katarzyna Szczepańska

2015-12-01

Full Text Available The paper presents models of potential geographical distribution of Xanthoparmelia delisei, X. loxodes, and X. verruculifera in Central Europe. The models were developed with MaxEnt (maximum entropy algorithm based on 224 collection localities and bioclimatic variables. The applied method enabled to identify the areas where climatic conditions are the most suitable for modeled species outside their known localities. According to obtained model, high potential distribution of the X. delisei and X. loxodes was found in the northern and northeastern Poland, when areas most suitable for X. verruculifera were placed in the south, especially in the Carpathians. Model also suggests that potential distribution of X. delisei could be wider than known data on its occurrence and extend to Lithuania, Belarus and the Czech Republic. MaxEnt modeling of X. loxodes showed the widest potential distribution for this species in Central Europe with the best regions in Lithuania. Potential distribution in all models was strongly influenced by precipitation-related variables. All the modelled species prefer areas where precipitation in the coldest quarter is very low.
The nucleon-nucleon correlations and the integral characteristics of the potential distributions in nuclei

International Nuclear Information System (INIS)

Knyaz'kov, O.M.; Kukhtina, I.N.

1989-01-01

The integral characteristics of the potential distribution in nuclei, namely the volume integrals, moments and mean square radii are studied in the framework of the semimicroscopic approach to the interaction of low energy nucleons with nuclei on the base of the exchange nucleon-nucleon correlations and the density dependence of effective forces. The ratio of the normalized multipole moments of potential and matter distributions is investigated. The energy dependence of the integral characteristics is analyzed. 15 refs.; 2 tabs
EEG datasets for motor imagery brain-computer interface.

Science.gov (United States)

Cho, Hohyun; Ahn, Minkyu; Ahn, Sangtae; Kwon, Moonyoung; Jun, Sung Chan

2017-07-01

Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states. © The Authors 2017. Published by Oxford University Press.
Spatial and Decadal Variations in Potential Evapotranspiration of China Based on Reanalysis Datasets during 1982–2010

Directory of Open Access Journals (Sweden)

Yunjun Yao

2014-10-01

Full Text Available Potential evapotranspiration (PET is an important indicator of atmospheric evaporation demand and has been widely used to characterize hydrological change. However, sparse observations of pan evaporation (EP prohibit the accurate characterization of the spatial and temporal patterns of PET over large spatial scales. In this study, we have estimated PET of China using the Penman-Monteith (PM method driven by gridded reanalysis datasets to analyze the spatial and decadal variations of PET in China during 1982–2010. The results show that the estimated PET has decreased on average by 3.3 mm per year (p < 0.05 over China during 1982–1993, while PET began to increase since 1994 by 3.4 mm per year (p < 0.05. The spatial pattern of the linear trend in PET of China illustrates that a widely significant increasing trend in PET appears during 1982–2010 in Northwest China, Central China, Northeast China and South China while there are no obvious variations of PET in other regions. Our findings illustrate that incident solar radiation (Rs is the largest contributor to the variation of PET in China, followed by vapor pressure deficit (VPD, air temperature (Tair and wind speed (WS. However, WS is the primary factor controlling inter-annual variation of PET over Northwest China.
Minimum energy requirements for desalination of brackish groundwater in the United States with comparison to international datasets

Science.gov (United States)

Ahdab, Yvana D.; Thiel, Gregory P.; Böhlke, John Karl; Stanton, Jennifer S.; Lienhard, John H.

2018-01-01

This paper uses chemical and physical data from a large 2017 U.S. Geological Surveygroundwater dataset with wells in the U.S. and three smaller international groundwater datasets with wells primarily in Australia and Spain to carry out a comprehensive investigation of brackish groundwater composition in relation to minimum desalinationenergy costs. First, we compute the site-specific least work required for groundwater desalination. Least work of separation represents a baseline for specific energy consumptionof desalination systems. We develop simplified equations based on the U.S. data for least work as a function of water recovery ratio and a proxy variable for composition, either total dissolved solids, specific conductance, molality or ionic strength. We show that the U.S. correlations for total dissolved solids and molality may be applied to the international datasets. We find that total molality can be used to calculate the least work of dilute solutions with very high accuracy. Then, we examine the effects of groundwater solute composition on minimum energy requirements, showing that separation requirements increase from calcium to sodium for cations and from sulfate to bicarbonate to chloride for anions, for any given TDS concentration. We study the geographic distribution of least work, total dissolved solids, and major ions concentration across the U.S. We determine areas with both low least work and high water stress in order to highlight regions holding potential for desalination to decrease the disparity between high water demand and low water supply. Finally, we discuss the implications of the USGS results on water resource planning, by comparing least work to the specific energy consumption of brackish water reverse osmosisplants and showing the scaling propensity of major electrolytes and silica in the U.S. groundwater samples.
Performance evaluation of tile-based Fisher Ratio analysis using a benchmark yeast metabolome dataset.

Science.gov (United States)

Watson, Nathanial E; Parsons, Brendon A; Synovec, Robert E

2016-08-12

Performance of tile-based Fisher Ratio (F-ratio) data analysis, recently developed for discovery-based studies using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC-TOFMS), is evaluated with a metabolomics dataset that had been previously analyzed in great detail, but while taking a brute force approach. The previously analyzed data (referred to herein as the benchmark dataset) were intracellular extracts from Saccharomyces cerevisiae (yeast), either metabolizing glucose (repressed) or ethanol (derepressed), which define the two classes in the discovery-based analysis to find metabolites that are statistically different in concentration between the two classes. Beneficially, this previously analyzed dataset provides a concrete means to validate the tile-based F-ratio software. Herein, we demonstrate and validate the significant benefits of applying tile-based F-ratio analysis. The yeast metabolomics data are analyzed more rapidly in about one week versus one year for the prior studies with this dataset. Furthermore, a null distribution analysis is implemented to statistically determine an adequate F-ratio threshold, whereby the variables with F-ratio values below the threshold can be ignored as not class distinguishing, which provides the analyst with confidence when analyzing the hit table. Forty-six of the fifty-four benchmarked changing metabolites were discovered by the new methodology while consistently excluding all but one of the benchmarked nineteen false positive metabolites previously identified. Copyright © 2016 Elsevier B.V. All rights reserved.
ASSISTments Dataset from Multiple Randomized Controlled Experiments

Science.gov (United States)

Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil

2016-01-01

In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…
Spatial and temporal patterns of global onshore wind speed distribution

International Nuclear Information System (INIS)

Zhou, Yuyu; Smith, Steven J

2013-01-01

Wind power, a renewable energy source, can play an important role in electrical energy generation. Information regarding wind energy potential is important both for energy related modeling and for decision-making in the policy community. While wind speed datasets with high spatial and temporal resolution are often ultimately used for detailed planning, simpler assumptions are often used in analysis work. An accurate representation of the wind speed frequency distribution is needed in order to properly characterize wind energy potential. Using a power density method, this study estimated global variation in wind parameters as fitted to a Weibull density function using NCEP/climate forecast system reanalysis (CFSR) data over land areas. The Weibull distribution performs well in fitting the time series wind speed data at most locations according to R 2 , root mean square error, and power density error. The wind speed frequency distribution, as represented by the Weibull k parameter, exhibits a large amount of spatial variation, a regionally varying amount of seasonal variation, and relatively low decadal variation. We also analyzed the potential error in wind power estimation when a commonly assumed Rayleigh distribution (Weibull k = 2) is used. We find that the assumption of the same Weibull parameter across large regions can result in non-negligible errors. While large-scale wind speed data are often presented in the form of mean wind speeds, these results highlight the need to also provide information on the wind speed frequency distribution. (letter)
Determination of Nerve Fiber Diameter Distribution From Compound Action Potential: A Continuous Approach.

Science.gov (United States)

Un, M Kerem; Kaghazchi, Hamed

2018-01-01

When a signal is initiated in the nerve, it is transmitted along each nerve fiber via an action potential (called single fiber action potential (SFAP)) which travels with a velocity that is related with the diameter of the fiber. The additive superposition of SFAPs constitutes the compound action potential (CAP) of the nerve. The fiber diameter distribution (FDD) in the nerve can be computed from the CAP data by solving an inverse problem. This is usually achieved by dividing the fibers into a finite number of diameter groups and solve a corresponding linear system to optimize FDD. However, number of fibers in a nerve can be measured sometimes in thousands and it is possible to assume a continuous distribution for the fiber diameters which leads to a gradient optimization problem. In this paper, we have evaluated this continuous approach to the solution of the inverse problem. We have utilized an analytical function for SFAP and an assumed a polynomial form for FDD. The inverse problem involves the optimization of polynomial coefficients to obtain the best estimate for the FDD. We have observed that an eighth order polynomial for FDD can capture both unimodal and bimodal fiber distributions present in vivo, even in case of noisy CAP data. The assumed FDD distribution regularizes the ill-conditioned inverse problem and produces good results.
SPANG: a SPARQL client supporting generation and reuse of queries for distributed RDF databases.

Science.gov (United States)

Chiba, Hirokazu; Uchiyama, Ikuo

2017-02-08

Toward improved interoperability of distributed biological databases, an increasing number of datasets have been published in the standardized Resource Description Framework (RDF). Although the powerful SPARQL Protocol and RDF Query Language (SPARQL) provides a basis for exploiting RDF databases, writing SPARQL code is burdensome for users including bioinformaticians. Thus, an easy-to-use interface is necessary. We developed SPANG, a SPARQL client that has unique features for querying RDF datasets. SPANG dynamically generates typical SPARQL queries according to specified arguments. It can also call SPARQL template libraries constructed in a local system or published on the Web. Further, it enables combinatorial execution of multiple queries, each with a distinct target database. These features facilitate easy and effective access to RDF datasets and integrative analysis of distributed data. SPANG helps users to exploit RDF datasets by generation and reuse of SPARQL queries through a simple interface. This client will enhance integrative exploitation of biological RDF datasets distributed across the Web. This software package is freely available at http://purl.org/net/spang .
Entropy-based heavy tailed distribution transformation and visual analytics for monitoring massive network traffic

Science.gov (United States)

Han, Keesook J.; Hodge, Matthew; Ross, Virginia W.

2011-06-01

For monitoring network traffic, there is an enormous cost in collecting, storing, and analyzing network traffic datasets. Data mining based network traffic analysis has a growing interest in the cyber security community, but is computationally expensive for finding correlations between attributes in massive network traffic datasets. To lower the cost and reduce computational complexity, it is desirable to perform feasible statistical processing on effective reduced datasets instead of on the original full datasets. Because of the dynamic behavior of network traffic, traffic traces exhibit mixtures of heavy tailed statistical distributions or overdispersion. Heavy tailed network traffic characterization and visualization are important and essential tasks to measure network performance for the Quality of Services. However, heavy tailed distributions are limited in their ability to characterize real-time network traffic due to the difficulty of parameter estimation. The Entropy-Based Heavy Tailed Distribution Transformation (EHTDT) was developed to convert the heavy tailed distribution into a transformed distribution to find the linear approximation. The EHTDT linearization has the advantage of being amenable to characterize and aggregate overdispersion of network traffic in realtime. Results of applying the EHTDT for innovative visual analytics to real network traffic data are presented.
Estimating parameters for probabilistic linkage of privacy-preserved datasets.

Science.gov (United States)

Brown, Adrian P; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Boyd, James H

2017-07-10

Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher
Viking Seismometer PDS Archive Dataset

Science.gov (United States)

Lorenz, R. D.

2016-12-01

The Viking Lander 2 seismometer operated successfully for over 500 Sols on the Martian surface, recording at least one likely candidate Marsquake. The Viking mission, in an era when data handling hardware (both on board and on the ground) was limited in capability, predated modern planetary data archiving, and ad-hoc repositories of the data, and the very low-level record at NSSDC, were neither convenient to process nor well-known. In an effort supported by the NASA Mars Data Analysis Program, we have converted the bulk of the Viking dataset (namely the 49,000 and 270,000 records made in High- and Event- modes at 20 and 1 Hz respectively) into a simple ASCII table format. Additionally, since wind-generated lander motion is a major component of the signal, contemporaneous meteorological data are included in summary records to facilitate correlation. These datasets are being archived at the PDS Geosciences Node. In addition to brief instrument and dataset descriptions, the archive includes code snippets in the freely-available language 'R' to demonstrate plotting and analysis. Further, we present examples of lander-generated noise, associated with the sampler arm, instrument dumps and other mechanical operations.

Distributed terascale volume visualization using distributed shared virtual memory

KAUST Repository

Beyer, Johanna

2011-10-01

Table 1 illustrates the impact of different distribution unit sizes, different screen resolutions, and numbers of GPU nodes. We use two and four GPUs (NVIDIA Quadro 5000 with 2.5 GB memory) and a mouse cortex EM dataset (see Figure 2) of resolution 21,494 x 25,790 x 1,850 = 955GB. The size of the virtual distribution units significantly influences the data distribution between nodes. Small distribution units result in a high depth complexity for compositing. Large distribution units lead to a low utilization of GPUs, because in the worst case only a single distribution unit will be in view, which is rendered by only a single node. The choice of an optimal distribution unit size depends on three major factors: the output screen resolution, the block cache size on each node, and the number of nodes. Currently, we are working on optimizing the compositing step and network communication between nodes. © 2011 IEEE.
The overlapping distribution method to compute chemical potentials of chain molecules

NARCIS (Netherlands)

Mooij, G.C.A.M.; Frenkel, D.

1994-01-01

The chemical potential of continuously deformable chain molecules can be estimated by measuring the average Rosenbluth weight associated with the virtual insertion of a molecule. We show how to generalize the overlapping-distribution method of Bennett to histograms of Rosenbluth weights. In this way
THE COMPARATIVE ANALYSIS OF TWO DIFFERENT STATISTICAL DISTRIBUTIONS USED TO ESTIMATE THE WIND ENERGY POTENTIAL

Directory of Open Access Journals (Sweden)

Mehmet KURBAN

2007-01-01

Full Text Available In this paper, the wind energy potential of the region is analyzed with Weibull and Reyleigh statistical distribution functions by using the wind speed data measured per 15 seconds in July, August, September, and October of 2005 at 10 m height of 30-m observation pole in the wind observation station constructed in the coverage of the scientific research project titled "The Construction of Hybrid (Wind-Solar Power Plant Model by Determining the Wind and Solar Potential in the Iki Eylul Campus of A.U." supported by Anadolu University. The Maximum likelihood method is used for finding the parameters of these distributions. The conclusion of the analysis for the months taken represents that the Weibull distribution models the wind speeds better than the Rayleigh distribution. Furthermore, the error rate in the monthly values of power density computed by using the Weibull distribution is smaller than the values by Rayleigh distribution.
Influences of climate change on the potential distribution of Lutzomyia longipalpis sensu lato (Psychodidae: Phlebotominae).

Science.gov (United States)

Peterson, A Townsend; Campbell, Lindsay P; Moo-Llanes, David A; Travi, Bruno; González, Camila; Ferro, María Cristina; Ferreira, Gabriel Eduardo Melim; Brandão-Filho, Sinval P; Cupolillo, Elisa; Ramsey, Janine; Leffer, Andreia Mauruto Chernaki; Pech-May, Angélica; Shaw, Jeffrey J

2017-09-01

This study explores the present day distribution of Lutzomyia longipalpis in relation to climate, and transfers the knowledge gained to likely future climatic conditions to predict changes in the species' potential distribution. We used ecological niche models calibrated based on occurrences of the species complex from across its known geographic range. Anticipated distributional changes varied by region, from stability to expansion or decline. Overall, models indicated no significant north-south expansion beyond present boundaries. However, some areas suitable both at present and in the future (e.g., Pacific coast of Ecuador and Peru) may offer opportunities for distributional expansion. Our models anticipated potential range expansion in southern Brazil and Argentina, but were variably successful in anticipating specific cases. The most significant climate-related change anticipated in the species' range was with regard to range continuity in the Amazon Basin, which is likely to increase in coming decades. Rather than making detailed forecasts of actual locations where Lu. longipalpis will appear in coming years, our models make interesting and potentially important predictions of broader-scale distributional tendencies that can inform heath policy and mitigation efforts. Copyright © 2017 Australian Society for Parasitology. Published by Elsevier Ltd. All rights reserved.
Digital Astronaut Photography: A Discovery Dataset for Archaeology

Science.gov (United States)

Stefanov, William L.

2010-01-01

Astronaut photography acquired from the International Space Station (ISS) using commercial off-the-shelf cameras offers a freely-accessible source for high to very high resolution (4-20 m/pixel) visible-wavelength digital data of Earth. Since ISS Expedition 1 in 2000, over 373,000 images of the Earth-Moon system (including land surface, ocean, atmospheric, and lunar images) have been added to the Gateway to Astronaut Photography of Earth online database (http://eol.jsc.nasa.gov ). Handheld astronaut photographs vary in look angle, time of acquisition, solar illumination, and spatial resolution. These attributes of digital astronaut photography result from a unique combination of ISS orbital dynamics, mission operations, camera systems, and the individual skills of the astronaut. The variable nature of astronaut photography makes the dataset uniquely useful for archaeological applications in comparison with more traditional nadir-viewing multispectral datasets acquired from unmanned orbital platforms. For example, surface features such as trenches, walls, ruins, urban patterns, and vegetation clearing and regrowth patterns may be accentuated by low sun angles and oblique viewing conditions (Fig. 1). High spatial resolution digital astronaut photographs can also be used with sophisticated land cover classification and spatial analysis approaches like Object Based Image Analysis, increasing the potential for use in archaeological characterization of landscapes and specific sites.
Potential effects of climate change on the distribution range of the main silicate sinker of the Southern Ocean.

Science.gov (United States)

Pinkernell, Stefan; Beszteri, Bánk

2014-08-01

Fragilariopsis kerguelensis, a dominant diatom species throughout the Antarctic Circumpolar Current, is coined to be one of the main drivers of the biological silicate pump. Here, we study the distribution of this important species and expected consequences of climate change upon it, using correlative species distribution modeling and publicly available presence-only data. As experience with SDM is scarce for marine phytoplankton, this also serves as a pilot study for this organism group. We used the maximum entropy method to calculate distribution models for the diatom F. kerguelensis based on yearly and monthly environmental data (sea surface temperature, salinity, nitrate and silicate concentrations). Observation data were harvested from GBIF and the Global Diatom Database, and for further analyses also from the Hustedt Diatom Collection (BRM). The models were projected on current yearly and seasonal environmental data to study current distribution and its seasonality. Furthermore, we projected the seasonal model on future environmental data obtained from climate models for the year 2100. Projected on current yearly averaged environmental data, all models showed similar distribution patterns for F. kerguelensis. The monthly model showed seasonality, for example, a shift of the southern distribution boundary toward the north in the winter. Projections on future scenarios resulted in a moderately to negligibly shrinking distribution area and a change in seasonality. We found a substantial bias in the publicly available observation datasets, which could be reduced by additional observation records we obtained from the Hustedt Diatom Collection. Present-day distribution patterns inferred from the models coincided well with background knowledge and previous reports about F. kerguelensis distribution, showing that maximum entropy-based distribution models are suitable to map distribution patterns for oceanic planktonic organisms. Our scenario projections indicate
A distributed algorithm for machine learning

Science.gov (United States)

Chen, Shihong

2018-04-01

This paper considers a distributed learning problem in which a group of machines in a connected network, each learning its own local dataset, aim to reach a consensus at an optimal model, by exchanging information only with their neighbors but without transmitting data. A distributed algorithm is proposed to solve this problem under appropriate assumptions.
Introduction of a simple-model-based land surface dataset for Europe

Science.gov (United States)

Orth, Rene; Seneviratne, Sonia I.

2015-04-01

Land surface hydrology can play a crucial role during extreme events such as droughts, floods and even heat waves. We introduce in this study a new hydrological dataset for Europe that consists of soil moisture, runoff and evapotranspiration (ET). It is derived with a simple water balance model (SWBM) forced with precipitation, temperature and net radiation. The SWBM dataset extends over the period 1984-2013 with a daily time step and 0.5° × 0.5° resolution. We employ a novel calibration approach, in which we consider 300 random parameter sets chosen from an observation-based range. Using several independent validation datasets representing soil moisture (or terrestrial water content), ET and streamflow, we identify the best performing parameter set and hence the new dataset. To illustrate its usefulness, the SWBM dataset is compared against several state-of-the-art datasets (ERA-Interim/Land, MERRA-Land, GLDAS-2-Noah, simulations of the Community Land Model Version 4), using all validation datasets as reference. For soil moisture dynamics it outperforms the benchmarks. Therefore the SWBM soil moisture dataset constitutes a reasonable alternative to sparse measurements, little validated model results, or proxy data such as precipitation indices. Also in terms of runoff the SWBM dataset performs well, whereas the evaluation of the SWBM ET dataset is overall satisfactory, but the dynamics are less well captured for this variable. This highlights the limitations of the dataset, as it is based on a simple model that uses uniform parameter values. Hence some processes impacting ET dynamics may not be captured, and quality issues may occur in regions with complex terrain. Even though the SWBM is well calibrated, it cannot replace more sophisticated models; but as their calibration is a complex task the present dataset may serve as a benchmark in future. In addition we investigate the sources of skill of the SWBM dataset and find that the parameter set has a similar
Modeling impacts of human footprint and soil variability on the potential distribution of invasive plant species in different biomes

Science.gov (United States)

Wan, Ji-Zhong; Wang, Chun-Jing; Yu, Fei-Hai

2017-11-01

Human footprint and soil variability may be important in shaping the spread of invasive plant species (IPS). However, until now, there is little knowledge on how human footprint and soil variability affect the potential distribution of IPS in different biomes. We used Maxent modeling to project the potential distribution of 29 IPS with wide distributions and long introduction histories in China based on various combinations of climatic correlates, soil characteristics and human footprint. Then, we evaluated the relative importance of each type of environmental variables (climate, soil and human footprint) as well as the difference in range and similarity of the potential distribution of IPS between different biomes. Human footprint and soil variables contributed to the prediction of the potential distribution of IPS, and different types of biomes had varying responses and degrees of impacts from the tested variables. Human footprint and soil variability had the highest tendency to increase the potential distribution of IPS in Montane Grasslands and Shrublands. We propose to integrate the assessment in impacts of human footprint and soil variability on the potential distribution of IPS in different biomes into the prevention and control of plant invasion.
Using large hydrological datasets to create a robust, physically based, spatially distributed model for Great Britain

Science.gov (United States)

Lewis, Elizabeth; Kilsby, Chris; Fowler, Hayley

2014-05-01

The impact of climate change on hydrological systems requires further quantification in order to inform water management. This study intends to conduct such analysis using hydrological models. Such models are of varying forms, of which conceptual, lumped parameter models and physically-based models are two important types. The majority of hydrological studies use conceptual models calibrated against measured river flow time series in order to represent catchment behaviour. This method often shows impressive results for specific problems in gauged catchments. However, the results may not be robust under non-stationary conditions such as climate change, as physical processes and relationships amenable to change are not accounted for explicitly. Moreover, conceptual models are less readily applicable to ungauged catchments, in which hydrological predictions are also required. As such, the physically based, spatially distributed model SHETRAN is used in this study to develop a robust and reliable framework for modelling historic and future behaviour of gauged and ungauged catchments across the whole of Great Britain. In order to achieve this, a large array of data completely covering Great Britain for the period 1960-2006 has been collated and efficiently stored ready for model input. The data processed include a DEM, rainfall, PE and maps of geology, soil and land cover. A desire to make the modelling system easy for others to work with led to the development of a user-friendly graphical interface. This allows non-experts to set up and run a catchment model in a few seconds, a process that can normally take weeks or months. The quality and reliability of the extensive dataset for modelling hydrological processes has also been evaluated. One aspect of this has been an assessment of error and uncertainty in rainfall input data, as well as the effects of temporal resolution in precipitation inputs on model calibration. SHETRAN has been updated to accept gridded rainfall
Momentum distribution, vibrational dynamics, and the potential of mean force in ice

Science.gov (United States)

Lin, Lin; Morrone, Joseph A.; Car, Roberto; Parrinello, Michele

2011-06-01

By analyzing the momentum distribution obtained from path integral and phonon calculations we find that the protons in hexagonal ice experience an anisotropic quasiharmonic effective potential with three distinct principal frequencies that reflect molecular orientation. Due to the importance of anisotropy, anharmonic features of the environment cannot be extracted from existing experimental distributions that involve the spherical average. The full directional distribution is required, and we give a theoretical prediction for this quantity that could be verified in future experiments. Within the quasiharmonic context, anharmonicity in the ground-state dynamics of the proton is substantial and has quantal origin, a finding that impacts the interpretation of several spectroscopies.
A hybrid organic-inorganic perovskite dataset

Science.gov (United States)

Kim, Chiho; Huan, Tran Doan; Krishnan, Sridevi; Ramprasad, Rampi

2017-05-01

Hybrid organic-inorganic perovskites (HOIPs) have been attracting a great deal of attention due to their versatility of electronic properties and fabrication methods. We prepare a dataset of 1,346 HOIPs, which features 16 organic cations, 3 group-IV cations and 4 halide anions. Using a combination of an atomic structure search method and density functional theory calculations, the optimized structures, the bandgap, the dielectric constant, and the relative energies of the HOIPs are uniformly prepared and validated by comparing with relevant experimental and/or theoretical data. We make the dataset available at Dryad Digital Repository, NoMaD Repository, and Khazana Repository (http://khazana.uconn.edu/), hoping that it could be useful for future data-mining efforts that can explore possible structure-property relationships and phenomenological models. Progressive extension of the dataset is expected as new organic cations become appropriate within the HOIP framework, and as additional properties are calculated for the new compounds found.
Genomics dataset of unidentified disclosed isolates

Directory of Open Access Journals (Sweden)

Bhagwan N. Rekadwad

2016-09-01

Full Text Available Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis. Keywords: BioLABs, Blunt ends, Genomics, NEB cutter, Restriction digestion, Short DNA sequences, Sticky ends
On radiation of electrons moving in braking electric fields with distributed potential

International Nuclear Information System (INIS)

Fedulov, V.I.; Suvorov, V.I.; Umirov, U.R.

2002-01-01

The characteristics of radiation of electron moving in flat structures with braking electric field created by an accelerating electrode and another electrode with distributed potential are investigated. The analytical expressions for definition of conditions for complete loss of energy by electron in structure with distributed potential and for arising the electron vibrations are received. Also expressions connecting the electron energy with the point of entry and its fluctuation frequency are received. The mathematical model of irradiation process is offered depending on energy and point of entry of the electron. The connection between a radiation wave length and position of point of entry of electrons in the braking electric field are found. A possibility of emerging the optical radiation in solid environments at passage of charge particles through substance is shown. (author)
Mapping species distributions with MAXENT using a geographically biased sample of presence data: a performance assessment of methods for correcting sampling bias.

Science.gov (United States)

Fourcade, Yoan; Engler, Jan O; Rödder, Dennis; Secondi, Jean

2014-01-01

MAXENT is now a common species distribution modeling (SDM) tool used by conservation practitioners for predicting the distribution of a species from a set of records and environmental predictors. However, datasets of species occurrence used to train the model are often biased in the geographical space because of unequal sampling effort across the study area. This bias may be a source of strong inaccuracy in the resulting model and could lead to incorrect predictions. Although a number of sampling bias correction methods have been proposed, there is no consensual guideline to account for it. We compared here the performance of five methods of bias correction on three datasets of species occurrence: one "virtual" derived from a land cover map, and two actual datasets for a turtle (Chrysemys picta) and a salamander (Plethodon cylindraceus). We subjected these datasets to four types of sampling biases corresponding to potential types of empirical biases. We applied five correction methods to the biased samples and compared the outputs of distribution models to unbiased datasets to assess the overall correction performance of each method. The results revealed that the ability of methods to correct the initial sampling bias varied greatly depending on bias type, bias intensity and species. However, the simple systematic sampling of records consistently ranked among the best performing across the range of conditions tested, whereas other methods performed more poorly in most cases. The strong effect of initial conditions on correction performance highlights the need for further research to develop a step-by-step guideline to account for sampling bias. However, this method seems to be the most efficient in correcting sampling bias and should be advised in most cases.
Hydrological modeling of the Peruvian–Ecuadorian Amazon Basin using GPM-IMERG satellite-based precipitation dataset

Directory of Open Access Journals (Sweden)

R. Zubieta

2017-07-01

Full Text Available In the last two decades, rainfall estimates provided by the Tropical Rainfall Measurement Mission (TRMM have proven applicable in hydrological studies. The Global Precipitation Measurement (GPM mission, which provides the new generation of rainfall estimates, is now considered a global successor to TRMM. The usefulness of GPM data in hydrological applications, however, has not yet been evaluated over the Andean and Amazonian regions. This study uses GPM data provided by the Integrated Multi-satellite Retrievals (IMERG (product/final run as input to a distributed hydrological model for the Amazon Basin of Peru and Ecuador for a 16-month period (from March 2014 to June 2015 when all datasets are available. TRMM products (TMPA V7 and TMPA RT datasets and a gridded precipitation dataset processed from observed rainfall are used for comparison. The results indicate that precipitation data derived from GPM-IMERG correspond more closely to TMPA V7 than TMPA RT datasets, but both GPM-IMERG and TMPA V7 precipitation data tend to overestimate, compared to observed rainfall (by 11.1 and 15.7 %, respectively. In general, GPM-IMERG, TMPA V7 and TMPA RT correlate with observed rainfall, with a similar number of rain events correctly detected ( ∼  20 %. Statistical analysis of modeled streamflows indicates that GPM-IMERG is as useful as TMPA V7 or TMPA RT datasets in southern regions (Ucayali Basin. GPM-IMERG, TMPA V7 and TMPA RT do not properly simulate streamflows in northern regions (Marañón and Napo basins, probably because of the lack of adequate rainfall estimates in northern Peru and the Ecuadorian Amazon.
Palmprint and Palmvein Recognition Based on DCNN and A New Large-Scale Contactless Palmvein Dataset

Directory of Open Access Journals (Sweden)

Lin Zhang

2018-03-01

Full Text Available Among the members of biometric identifiers, the palmprint and the palmvein have received significant attention due to their stability, uniqueness, and non-intrusiveness. In this paper, we investigate the problem of palmprint/palmvein recognition and propose a Deep Convolutional Neural Network (DCNN based scheme, namely P a l m R CNN (short for palmprint/palmvein recognition using CNNs. The effectiveness and efficiency of P a l m R CNN have been verified through extensive experiments conducted on benchmark datasets. In addition, though substantial effort has been devoted to palmvein recognition, it is still quite difficult for the researchers to know the potential discriminating capability of the contactless palmvein. One of the root reasons is that a large-scale and publicly available dataset comprising high-quality, contactless palmvein images is still lacking. To this end, a user-friendly acquisition device for collecting high quality contactless palmvein images is at first designed and developed in this work. Then, a large-scale palmvein image dataset is established, comprising 12,000 images acquired from 600 different palms in two separate collection sessions. The collected dataset now is publicly available.
IPCC Socio-Economic Baseline Dataset

Data.gov (United States)

National Aeronautics and Space Administration — The Intergovernmental Panel on Climate Change (IPCC) Socio-Economic Baseline Dataset consists of population, human development, economic, water resources, land...
The LANDFIRE Refresh strategy: updating the national dataset

Science.gov (United States)

Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley

2013-01-01

The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.
Software ion scan functions in analysis of glycomic and lipidomic MS/MS datasets.

Science.gov (United States)

Haramija, Marko

2018-03-01

Hardware ion scan functions unique to tandem mass spectrometry (MS/MS) mode of data acquisition, such as precursor ion scan (PIS) and neutral loss scan (NLS), are important for selective extraction of key structural data from complex MS/MS spectra. However, their software counterparts, software ion scan (SIS) functions, are still not regularly available. Software ion scan functions can be easily coded for additional functionalities, such as software multiple precursor ion scan, software no ion scan, and software variable ion scan functions. These are often necessary, since they allow more efficient analysis of complex MS/MS datasets, often encountered in glycomics and lipidomics. Software ion scan functions can be easily coded by using modern script languages and can be independent of instrument manufacturer. Here we demonstrate the utility of SIS functions on a medium-size glycomic MS/MS dataset. Knowledge of sample properties, as well as of diagnostic and conditional diagnostic ions crucial for data analysis, was needed. Based on the tables constructed with the output data from the SIS functions performed, a detailed analysis of a complex MS/MS glycomic dataset could be carried out in a quick, accurate, and efficient manner. Glycomic research is progressing slowly, and with respect to the MS experiments, one of the key obstacles for moving forward is the lack of appropriate bioinformatic tools necessary for fast analysis of glycomic MS/MS datasets. Adding novel SIS functionalities to the glycomic MS/MS toolbox has a potential to significantly speed up the glycomic data analysis process. Similar tools are useful for analysis of lipidomic MS/MS datasets as well, as will be discussed briefly. Copyright © 2017 John Wiley & Sons, Ltd.

VisIVO: A Library and Integrated Tools for Large Astrophysical Dataset Exploration

Science.gov (United States)

Becciani, U.; Costa, A.; Ersotelos, N.; Krokos, M.; Massimino, P.; Petta, C.; Vitello, F.

2012-09-01

VisIVO provides an integrated suite of tools and services that can be used in many scientific fields. VisIVO development starts in the Virtual Observatory framework. VisIVO allows users to visualize meaningfully highly-complex, large-scale datasets and create movies of these visualizations based on distributed infrastructures. VisIVO supports high-performance, multi-dimensional visualization of large-scale astrophysical datasets. Users can rapidly obtain meaningful visualizations while preserving full and intuitive control of the relevant parameters. VisIVO consists of VisIVO Desktop - a stand-alone application for interactive visualization on standard PCs, VisIVO Server - a platform for high performance visualization, VisIVO Web - a custom designed web portal, VisIVOSmartphone - an application to exploit the VisIVO Server functionality and the latest VisIVO features: VisIVO Library allows a job running on a computational system (grid, HPC, etc.) to produce movies directly with the code internal data arrays without the need to produce intermediate files. This is particularly important when running on large computational facilities, where the user wants to have a look at the results during the data production phase. For example, in grid computing facilities, images can be produced directly in the grid catalogue while the user code is running in a system that cannot be directly accessed by the user (a worker node). The deployment of VisIVO on the DG and gLite is carried out with the support of EDGI and EGI-Inspire projects. Depending on the structure and size of datasets under consideration, the data exploration process could take several hours of CPU for creating customized views and the production of movies could potentially last several days. For this reason an MPI parallel version of VisIVO could play a fundamental role in increasing performance, e.g. it could be automatically deployed on nodes that are MPI aware. A central concept in our development is thus to
Mr-Moose: An advanced SED-fitting tool for heterogeneous multi-wavelength datasets

Science.gov (United States)

Drouart, G.; Falkendal, T.

2018-04-01

We present the public release of Mr-Moose, a fitting procedure that is able to perform multi-wavelength and multi-object spectral energy distribution (SED) fitting in a Bayesian framework. This procedure is able to handle a large variety of cases, from an isolated source to blended multi-component sources from an heterogeneous dataset (i.e. a range of observation sensitivities and spectral/spatial resolutions). Furthermore, Mr-Moose handles upper-limits during the fitting process in a continuous way allowing models to be gradually less probable as upper limits are approached. The aim is to propose a simple-to-use, yet highly-versatile fitting tool fro handling increasing source complexity when combining multi-wavelength datasets with fully customisable filter/model databases. The complete control of the user is one advantage, which avoids the traditional problems related to the "black box" effect, where parameter or model tunings are impossible and can lead to overfitting and/or over-interpretation of the results. Also, while a basic knowledge of Python and statistics is required, the code aims to be sufficiently user-friendly for non-experts. We demonstrate the procedure on three cases: two artificially-generated datasets and a previous result from the literature. In particular, the most complex case (inspired by a real source, combining Herschel, ALMA and VLA data) in the context of extragalactic SED fitting, makes Mr-Moose a particularly-attractive SED fitting tool when dealing with partially blended sources, without the need for data deconvolution.
Omicseq: a web-based search engine for exploring omics datasets

Science.gov (United States)

Sun, Xiaobo; Pittard, William S.; Xu, Tianlei; Chen, Li; Zwick, Michael E.; Jiang, Xiaoqian; Wang, Fusheng

2017-01-01

Abstract The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve ‘findability’ of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. PMID:28402462
Nanoparticle-organic pollutant interaction dataset

Data.gov (United States)

U.S. Environmental Protection Agency — Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration...
Stable Flocking of Multiple Agents Based on Molecular Potential Field and Distributed Receding Horizon Control

International Nuclear Information System (INIS)

Zhang Yun-Peng; Duan Hai-Bin; Zhang Xiang-Yin

2011-01-01

A novel distributed control scheme to generate stable flocking motion for a group of agents is proposed. In this control scheme, a molecular potential field model is applied as the potential field function because of its smoothness and unique shape. The approach of distributed receding horizon control is adopted to drive each agent to find its optimal control input to lower its potential at every step. Experimental results show that this proposed control scheme can ensure that all agents eventually converge to a stable flocking formation with a common velocity and the collisions can also be avoided at the same time. (general)
Distribution of hydrocarbon-utilizing microorganisms and hydrocarbon biodegradation potentials in Alaskan continental shelf areas

International Nuclear Information System (INIS)

Roubal, G.; Atlas, R.M.

1978-01-01

Hydrocarbon-utilizing microogranisms were enumerated from Alaskan continental shelf areas by using plate counts and a new most-probable-number procedure based on mineralization of 14 C-labeled hydrocarbons. Hydrocarbon utilizers were ubiquitously distributed, with no significant overall concentration differences between sampling regions or between surface water and sediment samples. There were, however, significant seasonal differences in numbers of hydrocarbon utilizers. Distribution of hydrocarbon utilizers within Cook Inlet was positively correlated with occurrence of hydrocarbons in the environment. Hydrocarbon biodegradation potentials were measured by using 14 C-radiolabeled hydrocarbon-spiked crude oil. There was no significant correlation between numbers of hydrocarbon utilizers and hydrocarbon biodegradation potentials. The biodegradation potentials showed large seasonal variations in the Beaufort Sea, probably due to seasonal depletion of available nutrients. Non-nutrient-limited biodegradation potentials followed the order hexadecane > naphthalene >> pristane > benzanthracene. In Cook Inlet, biodegradation potentials for hexadecane and naphthalene were dependent on availability of inorganic nutrients. Biodegradation potentials for pristane and benzanthracene were restricted, probably by resistance to attack by available enzymes in the indigenous population
Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications

Science.gov (United States)

Maskey, M.; Ramachandran, R.; Miller, J.

2017-12-01

Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.
An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets

Directory of Open Access Journals (Sweden)

Kang Zhang

2014-01-01

Full Text Available Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.
Influence of the potential well and the potential barrier on the density distribution of confined-model fluids

CERN Document Server

Lee, B H; Lee, C H; Seong Baek Seok

2000-01-01

A density functional perturbative approximation, which is based on the density functional expansion of the one-particle direct correlation function of model fluids with respect to the bulk density, has been employed to investigate the influence of the potential well and the potential barrier on the density behavior of confined-model fluids. The mean spherical approximation has been used to calculate the two-particle direct correlation function of the model fluids. At lower densities, the density distributions are strongly affected by the barrier height and the well depth of the model potential, the contribution from the short-range repulsive part being especially important. However, the effects of the barrier height and the well depth of the model potential decrease with increasing bulk density. The calculated results also show that in the region where the effect of the wall-fluid interaction is relatively weak, the square-barrier part of the model potential leads to a nonuniformity in the density distributio...
Mass Distribution and Gravitational Potential of the Milky Way

Directory of Open Access Journals (Sweden)

Ninković Slobodan

2017-04-01

Full Text Available Models of mass distribution in the Milky Way are discussed where those yielding the potential analytically are preferred. It is noted that there are three main contributors to the Milky Way potential: bulge, disc and dark halo. In the case of the disc the Miyamoto-Nagai formula, as simple enough, has shown as a very good solution, but it has not been able to satisfy all requirements. Therefore, improvements, such as adding new terms or combining several Miyamoto-Nagai terms, have been attempted. Unlike the disc, in studying the bulge and dark halo the flattening is usually neglected, which offers the possibility of obtaining an exact solution of the Poisson equation. It is emphasized that the Hernquist formula, used very often for the bulge potential, is a special case of another formula and the properties of that formula are analysed. In the case of the dark halo, the slopes of its cumulative mass for the inner and outer parts are explained through a new formalism presented here for the first time.
Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation.

Science.gov (United States)

Yigzaw, Kassaye Yitbarek; Michalas, Antonis; Bellika, Johan Gustav

2017-01-03

Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N - 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians.
Chemical product and function dataset

Data.gov (United States)

U.S. Environmental Protection Agency — Merged product weight fraction and chemical function data. This dataset is associated with the following publication: Isaacs , K., M. Goldsmith, P. Egeghy , K....
Unleashing spatially distributed ecohydrology modeling using Big Data tools

Science.gov (United States)

Miles, B.; Idaszak, R.

2015-12-01

Physically based spatially distributed ecohydrology models are useful for answering science and management questions related to the hydrology and biogeochemistry of prairie, savanna, forested, as well as urbanized ecosystems. However, these models can produce hundreds of gigabytes of spatial output for a single model run over decadal time scales when run at regional spatial scales and moderate spatial resolutions (~100-km2+ at 30-m spatial resolution) or when run for small watersheds at high spatial resolutions (~1-km2 at 3-m spatial resolution). Numerical data formats such as HDF5 can store arbitrarily large datasets. However even in HPC environments, there are practical limits on the size of single files that can be stored and reliably backed up. Even when such large datasets can be stored, querying and analyzing these data can suffer from poor performance due to memory limitations and I/O bottlenecks, for example on single workstations where memory and bandwidth are limited, or in HPC environments where data are stored separately from computational nodes. The difficulty of storing and analyzing spatial data from ecohydrology models limits our ability to harness these powerful tools. Big Data tools such as distributed databases have the potential to surmount the data storage and analysis challenges inherent to large spatial datasets. Distributed databases solve these problems by storing data close to computational nodes while enabling horizontal scalability and fault tolerance. Here we present the architecture of and preliminary results from PatchDB, a distributed datastore for managing spatial output from the Regional Hydro-Ecological Simulation System (RHESSys). The initial version of PatchDB uses message queueing to asynchronously write RHESSys model output to an Apache Cassandra cluster. Once stored in the cluster, these data can be efficiently queried to quickly produce both spatial visualizations for a particular variable (e.g. maps and animations), as well
General Purpose Multimedia Dataset - GarageBand 2008

DEFF Research Database (Denmark)

Meng, Anders

This document describes a general purpose multimedia data-set to be used in cross-media machine learning problems. In more detail we describe the genre taxonomy applied at http://www.garageband.com, from where the data-set was collected, and how the taxonomy have been fused into a more human...... understandable taxonomy. Finally, a description of various features extracted from both the audio and text are presented....
Mapping the spatial distribution of global anthropogenic mercury atmospheric emission inventories

Science.gov (United States)

Wilson, Simon J.; Steenhuisen, Frits; Pacyna, Jozef M.; Pacyna, Elisabeth G.

This paper describes the procedures employed to spatially distribute global inventories of anthropogenic emissions of mercury to the atmosphere, prepared by Pacyna, E.G., Pacyna, J.M., Steenhuisen, F., Wilson, S. [2006. Global anthropogenic mercury emission inventory for 2000. Atmospheric Environment, this issue, doi:10.1016/j.atmosenv.2006.03.041], and briefly discusses the results of this work. A new spatially distributed global emission inventory for the (nominal) year 2000, and a revised version of the 1995 inventory are presented. Emissions estimates for total mercury and major species groups are distributed within latitude/longitude-based grids with a resolution of 1×1 and 0.5×0.5°. A key component in the spatial distribution procedure is the use of population distribution as a surrogate parameter to distribute emissions from sources that cannot be accurately geographically located. In this connection, new gridded population datasets were prepared, based on the CEISIN GPW3 datasets (CIESIN, 2004. Gridded Population of the World (GPW), Version 3. Center for International Earth Science Information Network (CIESIN), Columbia University and Centro Internacional de Agricultura Tropical (CIAT). GPW3 data are available at http://beta.sedac.ciesin.columbia.edu/gpw/index.jsp). The spatially distributed emissions inventories and population datasets prepared in the course of this work are available on the Internet at www.amap.no/Resources/HgEmissions/
Uncertainty of future projections of species distributions in mountainous regions.

Directory of Open Access Journals (Sweden)

Ying Tang

Full Text Available Multiple factors introduce uncertainty into projections of species distributions under climate change. The uncertainty introduced by the choice of baseline climate information used to calibrate a species distribution model and to downscale global climate model (GCM simulations to a finer spatial resolution is a particular concern for mountainous regions, as the spatial resolution of climate observing networks is often insufficient to detect the steep climatic gradients in these areas. Using the maximum entropy (MaxEnt modeling framework together with occurrence data on 21 understory bamboo species distributed across the mountainous geographic range of the Giant Panda, we examined the differences in projected species distributions obtained from two contrasting sources of baseline climate information, one derived from spatial interpolation of coarse-scale station observations and the other derived from fine-spatial resolution satellite measurements. For each bamboo species, the MaxEnt model was calibrated separately for the two datasets and applied to 17 GCM simulations downscaled using the delta method. Greater differences in the projected spatial distributions of the bamboo species were observed for the models calibrated using the different baseline datasets than between the different downscaled GCM simulations for the same calibration. In terms of the projected future climatically-suitable area by species, quantification using a multi-factor analysis of variance suggested that the sum of the variance explained by the baseline climate dataset used for model calibration and the interaction between the baseline climate data and the GCM simulation via downscaling accounted for, on average, 40% of the total variation among the future projections. Our analyses illustrate that the combined use of gridded datasets developed from station observations and satellite measurements can help estimate the uncertainty introduced by the choice of baseline
Omicseq: a web-based search engine for exploring omics datasets.

Science.gov (United States)

Sun, Xiaobo; Pittard, William S; Xu, Tianlei; Chen, Li; Zwick, Michael E; Jiang, Xiaoqian; Wang, Fusheng; Qin, Zhaohui S

2017-07-03

The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Quantifying uncertainty in observational rainfall datasets

Science.gov (United States)

Lennard, Chris; Dosio, Alessandro; Nikulin, Grigory; Pinto, Izidine; Seid, Hussen

2015-04-01

The CO-ordinated Regional Downscaling Experiment (CORDEX) has to date seen the publication of at least ten journal papers that examine the African domain during 2012 and 2013. Five of these papers consider Africa generally (Nikulin et al. 2012, Kim et al. 2013, Hernandes-Dias et al. 2013, Laprise et al. 2013, Panitz et al. 2013) and five have regional foci: Tramblay et al. (2013) on Northern Africa, Mariotti et al. (2014) and Gbobaniyi el al. (2013) on West Africa, Endris et al. (2013) on East Africa and Kalagnoumou et al. (2013) on southern Africa. There also are a further three papers that the authors know about under review. These papers all use an observed rainfall and/or temperature data to evaluate/validate the regional model output and often proceed to assess projected changes in these variables due to climate change in the context of these observations. The most popular reference rainfall data used are the CRU, GPCP, GPCC, TRMM and UDEL datasets. However, as Kalagnoumou et al. (2013) point out there are many other rainfall datasets available for consideration, for example, CMORPH, FEWS, TAMSAT & RIANNAA, TAMORA and the WATCH & WATCH-DEI data. They, with others (Nikulin et al. 2012, Sylla et al. 2012) show that the observed datasets can have a very wide spread at a particular space-time coordinate. As more ground, space and reanalysis-based rainfall products become available, all which use different methods to produce precipitation data, the selection of reference data is becoming an important factor in model evaluation. A number of factors can contribute to a uncertainty in terms of the reliability and validity of the datasets such as radiance conversion algorithims, the quantity and quality of available station data, interpolation techniques and blending methods used to combine satellite and guage based products. However, to date no comprehensive study has been performed to evaluate the uncertainty in these observational datasets. We assess 18 gridded
Scalar and Vector Spherical Harmonics for Assimilation of Global Datasets in the Ionosphere and Thermosphere

Science.gov (United States)

Miladinovich, D.; Datta-Barua, S.; Bust, G. S.; Ramirez, U.

2017-12-01

Understanding physical processes during storm time in the ionosphere-thermosphere (IT) system is limited, in part, due to the inability to obtain accurate estimates of IT states on a global scale. One reason for this inability is the sparsity of spatially distributed high quality data sets. Data assimilation is showing promise toward enabling global estimates by blending high quality observational data sets with established climate models. We are continuing development of an algorithm called Estimating Model Parameters for Ionospheric Reverse Engineering (EMPIRE) to enable assimilation of global datasets for storm time estimates of IT drivers. EMPIRE is a data assimilation algorithm that uses a Kalman filtering routine to ingest model and observational data. The EMPIRE algorithm is based on spherical harmonics which provide a spherically symmetric, smooth, continuous, and orthonormal set of basis functions suitable for a spherical domain such as Earth's IT region (200-600 km altitude). Once the basis function coefficients are determined, the newly fitted function represents the disagreement between observational measurements and models. We apply spherical harmonics to study the March 17, 2015 storm. Data sources include Fabry-Perot interferometer neutral wind measurements and global Ionospheric Data Assimilation 4 Dimensional (IDA4D) assimilated total electron content (TEC). Models include Weimer 2000 electric potential, International Geomagnetic Reference Field (IGRF) magnetic field, and Horizontal Wind Model 2014 (HWM14) neutral winds. We present the EMPIRE assimilation results of Earth's electric potential and thermospheric winds. We also compare EMPIRE storm time E cross B ion drift estimates to measured drifts produced from the Super Dual Auroral Radar Network (SuperDARN) and Active Magnetosphere and Planetary Electrodynamics Response Experiment (AMPERE) measurement datasets. The analysis from these results will enable the generation of globally assimilated
[Prediction of potential geographic distribution of Lyme disease in Qinghai province with Maximum Entropy model].

Science.gov (United States)

Zhang, Lin; Hou, Xuexia; Liu, Huixin; Liu, Wei; Wan, Kanglin; Hao, Qin

2016-01-01

To predict the potential geographic distribution of Lyme disease in Qinghai by using Maximum Entropy model (MaxEnt). The sero-diagnosis data of Lyme disease in 6 counties (Huzhu, Zeku, Tongde, Datong, Qilian and Xunhua) and the environmental and anthropogenic data including altitude, human footprint, normalized difference vegetation index (NDVI) and temperature in Qinghai province since 1990 were collected. By using the data of Huzhu Zeku and Tongde, the prediction of potential distribution of Lyme disease in Qinghai was conducted with MaxEnt. The prediction results were compared with the human sero-prevalence of Lyme disease in Datong, Qilian and Xunhua counties in Qinghai. Three hot spots of Lyme disease were predicted in Qinghai, which were all in the east forest areas. Furthermore, the NDVI showed the most important role in the model prediction, followed by human footprint. Datong, Qilian and Xunhua counties were all in eastern Qinghai. Xunhua was in hot spot areaⅡ, Datong was close to the north of hot spot area Ⅲ, while Qilian with lowest sero-prevalence of Lyme disease was not in the hot spot areas. The data were well modeled in MaxEnt (Area Under Curve=0.980). The actual distribution of Lyme disease in Qinghai was in consistent with the results of the model prediction. MaxEnt could be used in predicting the potential distribution patterns of Lyme disease. The distribution of vegetation and the range and intensity of human activity might be related with Lyme disease distribution.

PHYSICS PERFORMANCE AND DATASET (PPD)

CERN Multimedia

L. Silvestris

2012-01-01

Introduction The first part of the year presented an important test for the new Physics Performance and Dataset (PPD) group (cf. its mandate: http://cern.ch/go/8f77). The activity was focused on the validation of the new releases meant for the Monte Carlo (MC) production and the data-processing in 2012 (CMSSW 50X and 52X), and on the preparation of the 2012 operations. In view of the Chamonix meeting, the PPD and physics groups worked to understand the impact of the higher pile-up scenario on some of the flagship Higgs analyses to better quantify the impact of the high luminosity on the CMS physics potential. A task force is working on the optimisation of the reconstruction algorithms and on the code to cope with the performance requirements imposed by the higher event occupancy as foreseen for 2012. Concerning the preparation for the analysis of the new data, a new MC production has been prepared. The new samples, simulated at 8 TeV, are already being produced and the digitisation and recons...
The wildland-urban interface raster dataset of Catalonia.

Science.gov (United States)

Alcasena, Fermín J; Evers, Cody R; Vega-Garcia, Cristina

2018-04-01

We provide the wildland urban interface (WUI) map of the autonomous community of Catalonia (Northeastern Spain). The map encompasses an area of some 3.21 million ha and is presented as a 150-m resolution raster dataset. Individual housing location, structure density and vegetation cover data were used to spatially assess in detail the interface, intermix and dispersed rural WUI communities with a geographical information system. Most WUI areas concentrate in the coastal belt where suburban sprawl has occurred nearby or within unmanaged forests. This geospatial information data provides an approximation of residential housing potential for loss given a wildfire, and represents a valuable contribution to assist landscape and urban planning in the region.
Turkey Run Landfill Emissions Dataset

Data.gov (United States)

U.S. Environmental Protection Agency — landfill emissions measurements for the Turkey run landfill in Georgia. This dataset is associated with the following publication: De la Cruz, F., R. Green, G....
Data Discovery of Big and Diverse Climate Change Datasets - Options, Practices and Challenges

Science.gov (United States)

Palanisamy, G.; Boden, T.; McCord, R. A.; Frame, M. T.

2013-12-01

Developing data search tools is a very common, but often confusing, task for most of the data intensive scientific projects. These search interfaces need to be continually improved to handle the ever increasing diversity and volume of data collections. There are many aspects which determine the type of search tool a project needs to provide to their user community. These include: number of datasets, amount and consistency of discovery metadata, ancillary information such as availability of quality information and provenance, and availability of similar datasets from other distributed sources. Environmental Data Science and Systems (EDSS) group within the Environmental Science Division at the Oak Ridge National Laboratory has a long history of successfully managing diverse and big observational datasets for various scientific programs via various data centers such as DOE's Atmospheric Radiation Measurement Program (ARM), DOE's Carbon Dioxide Information and Analysis Center (CDIAC), USGS's Core Science Analytics and Synthesis (CSAS) metadata Clearinghouse and NASA's Distributed Active Archive Center (ORNL DAAC). This talk will showcase some of the recent developments for improving the data discovery within these centers The DOE ARM program recently developed a data discovery tool which allows users to search and discover over 4000 observational datasets. These datasets are key to the research efforts related to global climate change. The ARM discovery tool features many new functions such as filtered and faceted search logic, multi-pass data selection, filtering data based on data quality, graphical views of data quality and availability, direct access to data quality reports, and data plots. The ARM Archive also provides discovery metadata to other broader metadata clearinghouses such as ESGF, IASOA, and GOS. In addition to the new interface, ARM is also currently working on providing DOI metadata records to publishers such as Thomson Reuters and Elsevier. The ARM
Topic modeling for cluster analysis of large biological and medical datasets.

Science.gov (United States)

Zhao, Weizhong; Zou, Wen; Chen, James J

2014-01-01

The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting
An Analysis of the GTZAN Music Genre Dataset

DEFF Research Database (Denmark)

Sturm, Bob L.

2012-01-01

Most research in automatic music genre recognition has used the dataset assembled by Tzanetakis et al. in 2001. The composition and integrity of this dataset, however, has never been formally analyzed. For the first time, we provide an analysis of its composition, and create a machine...
Dataset definition for CMS operations and physics analyses

Science.gov (United States)

Franzoni, Giovanni; Compact Muon Solenoid Collaboration

2016-04-01

Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets and secondary datasets/dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concepts of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the LHC run I, and we discuss the plans for the run II.
Dataset definition for CMS operations and physics analyses

CERN Document Server

AUTHOR|(CDS)2051291

2016-01-01

Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets, secondary datasets, and dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concept of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the first run, and we discuss the plans for the second LHC run.
Condensing Massive Satellite Datasets For Rapid Interactive Analysis

Science.gov (United States)

Grant, G.; Gallaher, D. W.; Lv, Q.; Campbell, G. G.; Fowler, C.; LIU, Q.; Chen, C.; Klucik, R.; McAllister, R. A.

2015-12-01

Our goal is to enable users to interactively analyze massive satellite datasets, identifying anomalous data or values that fall outside of thresholds. To achieve this, the project seeks to create a derived database containing only the most relevant information, accelerating the analysis process. The database is designed to be an ancillary tool for the researcher, not an archival database to replace the original data. This approach is aimed at improving performance by reducing the overall size by way of condensing the data. The primary challenges of the project include: - The nature of the research question(s) may not be known ahead of time. - The thresholds for determining anomalies may be uncertain. - Problems associated with processing cloudy, missing, or noisy satellite imagery. - The contents and method of creation of the condensed dataset must be easily explainable to users. The architecture of the database will reorganize spatially-oriented satellite imagery into temporally-oriented columns of data (a.k.a., "data rods") to facilitate time-series analysis. The database itself is an open-source parallel database, designed to make full use of clustered server technologies. A demonstration of the system capabilities will be shown. Applications for this technology include quick-look views of the data, as well as the potential for on-board satellite processing of essential information, with the goal of reducing data latency.
Dataset of NRDA emission data

Data.gov (United States)

U.S. Environmental Protection Agency — Emissions data from open air oil burns. This dataset is associated with the following publication: Gullett, B., J. Aurell, A. Holder, B. Mitchell, D. Greenwell, M....
Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.

Science.gov (United States)

Kohli, Marc D; Summers, Ronald M; Geis, J Raymond

2017-08-01

At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities.
Discovery and Reuse of Open Datasets: An Exploratory Study

Directory of Open Access Journals (Sweden)

Sara

2016-07-01

Full Text Available Objective: This article analyzes twenty cited or downloaded datasets and the repositories that house them, in order to produce insights that can be used by academic libraries to encourage discovery and reuse of research data in institutional repositories. Methods: Using Thomson Reuters’ Data Citation Index and repository download statistics, we identified twenty cited/downloaded datasets. We documented the characteristics of the cited/downloaded datasets and their corresponding repositories in a self-designed rubric. The rubric includes six major categories: basic information; funding agency and journal information; linking and sharing; factors to encourage reuse; repository characteristics; and data description. Results: Our small-scale study suggests that cited/downloaded datasets generally comply with basic recommendations for facilitating reuse: data are documented well; formatted for use with a variety of software; and shared in established, open access repositories. Three significant factors also appear to contribute to dataset discovery: publishing in discipline-specific repositories; indexing in more than one location on the web; and using persistent identifiers. The cited/downloaded datasets in our analysis came from a few specific disciplines, and tended to be funded by agencies with data publication mandates. Conclusions: The results of this exploratory research provide insights that can inform academic librarians as they work to encourage discovery and reuse of institutional datasets. Our analysis also suggests areas in which academic librarians can target open data advocacy in their communities in order to begin to build open data success stories that will fuel future advocacy efforts.
A synthetic dataset for evaluating soft and hard fusion algorithms

Science.gov (United States)

Graham, Jacob L.; Hall, David L.; Rimland, Jeffrey

2011-06-01

There is an emerging demand for the development of data fusion techniques and algorithms that are capable of combining conventional "hard" sensor inputs such as video, radar, and multispectral sensor data with "soft" data including textual situation reports, open-source web information, and "hard/soft" data such as image or video data that includes human-generated annotations. New techniques that assist in sense-making over a wide range of vastly heterogeneous sources are critical to improving tactical situational awareness in counterinsurgency (COIN) and other asymmetric warfare situations. A major challenge in this area is the lack of realistic datasets available for test and evaluation of such algorithms. While "soft" message sets exist, they tend to be of limited use for data fusion applications due to the lack of critical message pedigree and other metadata. They also lack corresponding hard sensor data that presents reasonable "fusion opportunities" to evaluate the ability to make connections and inferences that span the soft and hard data sets. This paper outlines the design methodologies, content, and some potential use cases of a COIN-based synthetic soft and hard dataset created under a United States Multi-disciplinary University Research Initiative (MURI) program funded by the U.S. Army Research Office (ARO). The dataset includes realistic synthetic reports from a variety of sources, corresponding synthetic hard data, and an extensive supporting database that maintains "ground truth" through logical grouping of related data into "vignettes." The supporting database also maintains the pedigree of messages and other critical metadata.
Visualization of conserved structures by fusing highly variable datasets.

Science.gov (United States)

Silverstein, Jonathan C; Chhadia, Ankur; Dech, Fred

2002-01-01

Skill, effort, and time are required to identify and visualize anatomic structures in three-dimensions from radiological data. Fundamentally, automating these processes requires a technique that uses symbolic information not in the dynamic range of the voxel data. We were developing such a technique based on mutual information for automatic multi-modality image fusion (MIAMI Fuse, University of Michigan). This system previously demonstrated facility at fusing one voxel dataset with integrated symbolic structure information to a CT dataset (different scale and resolution) from the same person. The next step of development of our technique was aimed at accommodating the variability of anatomy from patient to patient by using warping to fuse our standard dataset to arbitrary patient CT datasets. A standard symbolic information dataset was created from the full color Visible Human Female by segmenting the liver parenchyma, portal veins, and hepatic veins and overwriting each set of voxels with a fixed color. Two arbitrarily selected patient CT scans of the abdomen were used for reference datasets. We used the warping functions in MIAMI Fuse to align the standard structure data to each patient scan. The key to successful fusion was the focused use of multiple warping control points that place themselves around the structure of interest automatically. The user assigns only a few initial control points to align the scans. Fusion 1 and 2 transformed the atlas with 27 points around the liver to CT1 and CT2 respectively. Fusion 3 transformed the atlas with 45 control points around the liver to CT1 and Fusion 4 transformed the atlas with 5 control points around the portal vein. The CT dataset is augmented with the transformed standard structure dataset, such that the warped structure masks are visualized in combination with the original patient dataset. This combined volume visualization is then rendered interactively in stereo on the ImmersaDesk in an immersive Virtual
Dataset - Evaluation of Standardized Sample Collection, Packaging, and Decontamination Procedures to Assess Cross-Contamination Potential during Bacillus anthracis Incident Response Operations

Data.gov (United States)

U.S. Environmental Protection Agency — Spore recovery data during sample packaging decontamination tests. This dataset is associated with the following publication: Calfee, W., J. Tufts, K. Meyer, K....
An Annotated Dataset of 14 Cardiac MR Images

DEFF Research Database (Denmark)

Stegmann, Mikkel Bille

2002-01-01

This note describes a dataset consisting of 14 annotated cardiac MR images. Points of correspondence are placed on each image at the left ventricle (LV). As such, the dataset can be readily used for building statistical models of shape. Further, format specifications and terms of use are given....
Dataset - Adviesregel PPL 2010

NARCIS (Netherlands)

Evert, van F.K.; Schans, van der D.A.; Geel, van W.C.A.; Slabbekoorn, J.J.; Booij, R.; Jukema, J.N.; Meurs, E.J.J.; Uenk, D.

2011-01-01

This dataset contains experimental data from a number of field experiments with potato in The Netherlands (Van Evert et al., 2011). The data are presented as an SQL dump of a PostgreSQL database (version 8.4.4). An outline of the entity-relationship diagram of the database is given in an
Geostatistical exploration of dataset assessing the heavy metal contamination in Ewekoro limestone, Southwestern Nigeria

Directory of Open Access Journals (Sweden)

Kehinde D. Oyeyemi

2017-10-01

Full Text Available The dataset for this article contains geostatistical analysis of heavy metals contamination from limestone samples collected from Ewekoro Formation in the eastern Dahomey basin, Ogun State Nigeria. The samples were manually collected and analysed using Microwave Plasma Atomic Absorption Spectrometer (MPAS. Analysis of the twenty different samples showed different levels of heavy metals concentration. The analysed nine elements are Arsenic, Mercury, Cadmium, Cobalt, Chromium, Nickel, Lead, Vanadium and Zinc. Descriptive statistics was used to explore the heavy metal concentrations individually. Pearson, Kendall tau and Spearman rho correlation coefficients was used to establish the relationships among the elements and the analysis of variance showed that there is a significant difference in the mean distribution of the heavy metals concentration within and between the groups of the 20 samples analysed. The dataset can provide insights into the health implications of the contaminants especially when the mean concentration levels of the heavy metals are compared with recommended regulatory limit concentration.
Tension in the recent Type Ia supernovae datasets

International Nuclear Information System (INIS)

Wei, Hao

2010-01-01

In the present work, we investigate the tension in the recent Type Ia supernovae (SNIa) datasets Constitution and Union. We show that they are in tension not only with the observations of the cosmic microwave background (CMB) anisotropy and the baryon acoustic oscillations (BAO), but also with other SNIa datasets such as Davis and SNLS. Then, we find the main sources responsible for the tension. Further, we make this more robust by employing the method of random truncation. Based on the results of this work, we suggest two truncated versions of the Union and Constitution datasets, namely the UnionT and ConstitutionT SNIa samples, whose behaviors are more regular.
Viability of Controlling Prosthetic Hand Utilizing Electroencephalograph (EEG) Dataset Signal

Science.gov (United States)

Miskon, Azizi; A/L Thanakodi, Suresh; Raihan Mazlan, Mohd; Mohd Haziq Azhar, Satria; Nooraya Mohd Tawil, Siti

2016-11-01

This project presents the development of an artificial hand controlled by Electroencephalograph (EEG) signal datasets for the prosthetic application. The EEG signal datasets were used as to improvise the way to control the prosthetic hand compared to the Electromyograph (EMG). The EMG has disadvantages to a person, who has not used the muscle for a long time and also to person with degenerative issues due to age factor. Thus, the EEG datasets found to be an alternative for EMG. The datasets used in this work were taken from Brain Computer Interface (BCI) Project. The datasets were already classified for open, close and combined movement operations. It served the purpose as an input to control the prosthetic hand by using an Interface system between Microsoft Visual Studio and Arduino. The obtained results reveal the prosthetic hand to be more efficient and faster in response to the EEG datasets with an additional LiPo (Lithium Polymer) battery attached to the prosthetic. Some limitations were also identified in terms of the hand movements, weight of the prosthetic, and the suggestions to improve were concluded in this paper. Overall, the objective of this paper were achieved when the prosthetic hand found to be feasible in operation utilizing the EEG datasets.

A procedure to validate and correct the {sup 13}C chemical shift calibration of RNA datasets

Energy Technology Data Exchange (ETDEWEB)

Aeschbacher, Thomas; Schubert, Mario, E-mail: schubert@mol.biol.ethz.ch; Allain, Frederic H.-T., E-mail: allain@mol.biol.ethz.ch [ETH Zuerich, Institute for Molecular Biology and Biophysics (Switzerland)

2012-02-15

Chemical shifts reflect the structural environment of a certain nucleus and can be used to extract structural and dynamic information. Proper calibration is indispensable to extract such information from chemical shifts. Whereas a variety of procedures exist to verify the chemical shift calibration for proteins, no such procedure is available for RNAs to date. We present here a procedure to analyze and correct the calibration of {sup 13}C NMR data of RNAs. Our procedure uses five {sup 13}C chemical shifts as a reference, each of them found in a narrow shift range in most datasets deposited in the Biological Magnetic Resonance Bank. In 49 datasets we could evaluate the {sup 13}C calibration and detect errors or inconsistencies in RNA {sup 13}C chemical shifts based on these chemical shift reference values. More than half of the datasets (27 out of those 49) were found to be improperly referenced or contained inconsistencies. This large inconsistency rate possibly explains that no clear structure-{sup 13}C chemical shift relationship has emerged for RNA so far. We were able to recalibrate or correct 17 datasets resulting in 39 usable {sup 13}C datasets. 6 new datasets from our lab were used to verify our method increasing the database to 45 usable datasets. We can now search for structure-chemical shift relationships with this improved list of {sup 13}C chemical shift data. This is demonstrated by a clear relationship between ribose {sup 13}C shifts and the sugar pucker, which can be used to predict a C2 Prime - or C3 Prime -endo conformation of the ribose with high accuracy. The improved quality of the chemical shift data allows statistical analysis with the potential to facilitate assignment procedures, and the extraction of restraints for structure calculations of RNA.
Technical note: An inorganic water chemistry dataset (1972–2011 ...

African Journals Online (AJOL)

A national dataset of inorganic chemical data of surface waters (rivers, lakes, and dams) in South Africa is presented and made freely available. The dataset comprises more than 500 000 complete water analyses from 1972 up to 2011, collected from more than 2 000 sample monitoring stations in South Africa. The dataset ...
A Bayesian trans-dimensional approach for the fusion of multiple geophysical datasets

Science.gov (United States)

JafarGandomi, Arash; Binley, Andrew

2013-09-01

We propose a Bayesian fusion approach to integrate multiple geophysical datasets with different coverage and sensitivity. The fusion strategy is based on the capability of various geophysical methods to provide enough resolution to identify either subsurface material parameters or subsurface structure, or both. We focus on electrical resistivity as the target material parameter and electrical resistivity tomography (ERT), electromagnetic induction (EMI), and ground penetrating radar (GPR) as the set of geophysical methods. However, extending the approach to different sets of geophysical parameters and methods is straightforward. Different geophysical datasets are entered into a trans-dimensional Markov chain Monte Carlo (McMC) search-based joint inversion algorithm. The trans-dimensional property of the McMC algorithm allows dynamic parameterisation of the model space, which in turn helps to avoid bias of the post-inversion results towards a particular model. Given that we are attempting to develop an approach that has practical potential, we discretize the subsurface into an array of one-dimensional earth-models. Accordingly, the ERT data that are collected by using two-dimensional acquisition geometry are re-casted to a set of equivalent vertical electric soundings. Different data are inverted either individually or jointly to estimate one-dimensional subsurface models at discrete locations. We use Shannon's information measure to quantify the information obtained from the inversion of different combinations of geophysical datasets. Information from multiple methods is brought together via introducing joint likelihood function and/or constraining the prior information. A Bayesian maximum entropy approach is used for spatial fusion of spatially dispersed estimated one-dimensional models and mapping of the target parameter. We illustrate the approach with a synthetic dataset and then apply it to a field dataset. We show that the proposed fusion strategy is
Wind and wave dataset for Matara, Sri Lanka

Science.gov (United States)

Luo, Yao; Wang, Dongxiao; Priyadarshana Gamage, Tilak; Zhou, Fenghua; Madusanka Widanage, Charith; Liu, Taiwei

2018-01-01

We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1) is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017) is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447).
Wind and wave dataset for Matara, Sri Lanka

Directory of Open Access Journals (Sweden)

Y. Luo

2018-01-01

Full Text Available We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1 is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017 is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447.
Heuristics for Relevancy Ranking of Earth Dataset Search Results

Science.gov (United States)

Lynnes, Christopher; Quinn, Patrick; Norton, James

2016-01-01

As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.
QSAR ligand dataset for modelling mutagenicity, genotoxicity, and rodent carcinogenicity

Directory of Open Access Journals (Sweden)

Davy Guan

2018-04-01

Full Text Available Five datasets were constructed from ligand and bioassay result data from the literature. These datasets include bioassay results from the Ames mutagenicity assay, Greenscreen GADD-45a-GFP assay, Syrian Hamster Embryo (SHE assay, and 2 year rat carcinogenicity assay results. These datasets provide information about chemical mutagenicity, genotoxicity and carcinogenicity.
Mexican plums (Spondias spp.): their current distribution and potential distribution under climate change scenarios for Mexico

OpenAIRE

Arce-Romero, Antonio Rafael; Monterroso-Rivas, Alejandro Ismael; Gómez-Díaz, Jesús David; Cruz-León, Artemio

2017-01-01

Abstract Plums (Spondias spp.) are species native to Mexico with adaptive, nutritional and ethnobotanical advantages. The aim of this study was to assess the current and potential distribution of two species of Mexican plum: Spondias purpurea L. and Spondias mombin L. The method applied was ecological niche modeling in Maxent software, which has been used in Mexico with good results. In fieldwork, information on the presence of these species in the country was collected. In addition, environm...
The Dataset of Countries at Risk of Electoral Violence

OpenAIRE

Birch, Sarah; Muchlinski, David

2017-01-01

Electoral violence is increasingly affecting elections around the world, yet researchers have been limited by a paucity of granular data on this phenomenon. This paper introduces and describes a new dataset of electoral violence – the Dataset of Countries at Risk of Electoral Violence (CREV) – that provides measures of 10 different types of electoral violence across 642 elections held around the globe between 1995 and 2013. The paper provides a detailed account of how and why the dataset was ...
Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

Science.gov (United States)

Spjuth, Ola; Willighagen, Egon L; Guha, Rajarshi; Eklund, Martin; Wikberg, Jarl Es

2010-06-30

QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but
Towards interoperable and reproducible QSAR analyses: Exchange of datasets

Directory of Open Access Journals (Sweden)

Spjuth Ola

2010-06-01

Full Text Available Abstract Background QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. Results We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Conclusions Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join
VideoWeb Dataset for Multi-camera Activities and Non-verbal Communication

Science.gov (United States)

Denina, Giovanni; Bhanu, Bir; Nguyen, Hoang Thanh; Ding, Chong; Kamal, Ahmed; Ravishankar, Chinya; Roy-Chowdhury, Amit; Ivers, Allen; Varda, Brenda

Human-activity recognition is one of the most challenging problems in computer vision. Researchers from around the world have tried to solve this problem and have come a long way in recognizing simple motions and atomic activities. As the computer vision community heads toward fully recognizing human activities, a challenging and labeled dataset is needed. To respond to that need, we collected a dataset of realistic scenarios in a multi-camera network environment (VideoWeb) involving multiple persons performing dozens of different repetitive and non-repetitive activities. This chapter describes the details of the dataset. We believe that this VideoWeb Activities dataset is unique and it is one of the most challenging datasets available today. The dataset is publicly available online at http://vwdata.ee.ucr.edu/ along with the data annotation.
Toward computational cumulative biology by combining models of biological datasets.

Science.gov (United States)

Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

2014-01-01

A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
Knowledge discovery in large model datasets in the marine environment: the THREDDS Data Server example

Directory of Open Access Journals (Sweden)

A. Bergamasco

2012-06-01

Full Text Available In order to monitor, describe and understand the marine environment, many research institutions are involved in the acquisition and distribution of ocean data, both from observations and models. Scientists from these institutions are spending too much time looking for, accessing, and reformatting data: they need better tools and procedures to make the science they do more efficient. The U.S. Integrated Ocean Observing System (US-IOOS is working on making large amounts of distributed data usable in an easy and efficient way. It is essentially a network of scientists, technicians and technologies designed to acquire, collect and disseminate observational and modelled data resulting from coastal and oceanic marine regions investigations to researchers, stakeholders and policy makers. In order to be successful, this effort requires standard data protocols, web services and standards-based tools. Starting from the US-IOOS approach, which is being adopted throughout much of the oceanographic and meteorological sectors, we describe here the CNR-ISMAR Venice experience in the direction of setting up a national Italian IOOS framework using the THREDDS (THematic Real-time Environmental Distributed Data Services Data Server (TDS, a middleware designed to fill the gap between data providers and data users. The TDS provides services that allow data users to find the data sets pertaining to their scientific needs, to access, to visualize and to use them in an easy way, without downloading files to the local workspace. In order to achieve this, it is necessary that the data providers make their data available in a standard form that the TDS understands, and with sufficient metadata to allow the data to be read and searched in a standard way. The core idea is then to utilize a Common Data Model (CDM, a unified conceptual model that describes different datatypes within each dataset. More specifically, Unidata (www.unidata.ucar.edu has developed CDM
Exploiting Distributed, Heterogeneous and Sensitive Data Stocks while Maintaining the Owner's Data Sovereignty.

Science.gov (United States)

Lablans, M; Kadioglu, D; Muscholl, M; Ückert, F

2015-01-01

To achieve statistical significance in medical research, biological or data samples from several bio- or databanks often need to be complemented by those of other institutions. For that purpose, IT-based search services have been established to locate datasets matching a given set of criteria in databases distributed across several institutions. However, previous approaches require data owners to disclose information about their samples, raising a barrier for their participation in the network. To devise a method to search distributed databases for datasets matching a given set of criteria while fully maintaining their owner's data sovereignty. As a modification to traditional federated search services, we propose the decentral search, which allows the data owner a high degree of control. Relevant data are loaded into local bridgeheads, each under their owner's sovereignty. Researchers can formulate criteria sets along with a project proposal using a central search broker, which then notifies the bridgeheads. The criteria are, however, treated as an inquiry rather than a query: Instead of responding with results, bridgeheads notify their owner and wait for his/her decision regarding whether and what to answer based on the criteria set, the matching datasets and the specific project proposal. Without the owner's explicit consent, no data leaves his/her institution. The decentral search has been deployed in one of the six German Centers for Health Research, comprised of eleven university hospitals. In the process, compliance with German data protection regulations has been confirmed. The decentral search also marks the centerpiece of an open source registry software toolbox aiming to build a national registry of rare diseases in Germany. While the sacrifice of real-time answers impairs some use-cases, it leads to several beneficial side effects: improved data protection due to data parsimony, tolerance for incomplete data schema mappings and flexibility with regard
3DSEM: A 3D microscopy dataset

Directory of Open Access Journals (Sweden)

Ahmad P. Tafti

2016-03-01

Full Text Available The Scanning Electron Microscope (SEM as a 2D imaging instrument has been widely used in many scientific disciplines including biological, mechanical, and materials sciences to determine the surface attributes of microscopic objects. However the SEM micrographs still remain 2D images. To effectively measure and visualize the surface properties, we need to truly restore the 3D shape model from 2D SEM images. Having 3D surfaces would provide anatomic shape of micro-samples which allows for quantitative measurements and informative visualization of the specimens being investigated. The 3DSEM is a dataset for 3D microscopy vision which is freely available at [1] for any academic, educational, and research purposes. The dataset includes both 2D images and 3D reconstructed surfaces of several real microscopic samples. Keywords: 3D microscopy dataset, 3D microscopy vision, 3D SEM surface reconstruction, Scanning Electron Microscope (SEM
Active Semisupervised Clustering Algorithm with Label Propagation for Imbalanced and Multidensity Datasets

Directory of Open Access Journals (Sweden)

Mingwei Leng

2013-01-01

Full Text Available The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.
A reanalysis dataset of the South China Sea

Science.gov (United States)

Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

2014-01-01

Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992–2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability. PMID:25977803
A Dataset for Visual Navigation with Neuromorphic Methods

Directory of Open Access Journals (Sweden)

Francisco eBarranco

2016-02-01

Full Text Available Standardized benchmarks in Computer Vision have greatly contributed to the advance of approaches to many problems in the field. If we want to enhance the visibility of event-driven vision and increase its impact, we will need benchmarks that allow comparison among different neuromorphic methods as well as comparison to Computer Vision conventional approaches. We present datasets to evaluate the accuracy of frame-free and frame-based approaches for tasks of visual navigation. Similar to conventional Computer Vision datasets, we provide synthetic and real scenes, with the synthetic data created with graphics packages, and the real data recorded using a mobile robotic platform carrying a dynamic and active pixel vision sensor (DAVIS and an RGB+Depth sensor. For both datasets the cameras move with a rigid motion in a static scene, and the data includes the images, events, optic flow, 3D camera motion, and the depth of the scene, along with calibration procedures. Finally, we also provide simulated event data generated synthetically from well-known frame-based optical flow datasets.
Cross-Dataset Analysis and Visualization Driven by Expressive Web Services

Science.gov (United States)

Alexandru Dumitru, Mircea; Catalin Merticariu, Vlad

2015-04-01

The deluge of data that is hitting us every day from satellite and airborne sensors is changing the workflow of environmental data analysts and modelers. Web geo-services play now a fundamental role, and are no longer needed to preliminary download and store the data, but rather they interact in real-time with GIS applications. Due to the very large amount of data that is curated and made available by web services, it is crucial to deploy smart solutions for optimizing network bandwidth, reducing duplication of data and moving the processing closer to the data. In this context we have created a visualization application for analysis and cross-comparison of aerosol optical thickness datasets. The application aims to help researchers identify and visualize discrepancies between datasets coming from various sources, having different spatial and time resolutions. It also acts as a proof of concept for integration of OGC Web Services under a user-friendly interface that provides beautiful visualizations of the explored data. The tool was built on top of the World Wind engine, a Java based virtual globe built by NASA and the open source community. For data retrieval and processing we exploited the OGC Web Coverage Service potential: the most exciting aspect being its processing extension, a.k.a. the OGC Web Coverage Processing Service (WCPS) standard. A WCPS-compliant service allows a client to execute a processing query on any coverage offered by the server. By exploiting a full grammar, several different kinds of information can be retrieved from one or more datasets together: scalar condensers, cross-sectional profiles, comparison maps and plots, etc. This combination of technology made the application versatile and portable. As the processing is done on the server-side, we ensured that the minimal amount of data is transferred and that the processing is done on a fully-capable server, leaving the client hardware resources to be used for rendering the visualization

Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

Science.gov (United States)

Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

2014-01-01

SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111
A bias-corrected CMIP5 dataset for Africa using the CDF-t method - a contribution to agricultural impact studies

Science.gov (United States)

Moise Famien, Adjoua; Janicot, Serge; Delfin Ochou, Abe; Vrac, Mathieu; Defrance, Dimitri; Sultan, Benjamin; Noël, Thomas

2018-03-01

The objective of this paper is to present a new dataset of bias-corrected CMIP5 global climate model (GCM) daily data over Africa. This dataset was obtained using the cumulative distribution function transform (CDF-t) method, a method that has been applied to several regions and contexts but never to Africa. Here CDF-t has been applied over the period 1950-2099 combining Historical runs and climate change scenarios for six variables: precipitation, mean near-surface air temperature, near-surface maximum air temperature, near-surface minimum air temperature, surface downwelling shortwave radiation, and wind speed, which are critical variables for agricultural purposes. WFDEI has been used as the reference dataset to correct the GCMs. Evaluation of the results over West Africa has been carried out on a list of priority user-based metrics that were discussed and selected with stakeholders. It includes simulated yield using a crop model simulating maize growth. These bias-corrected GCM data have been compared with another available dataset of bias-corrected GCMs using WATCH Forcing Data as the reference dataset. The impact of WFD, WFDEI, and also EWEMBI reference datasets has been also examined in detail. It is shown that CDF-t is very effective at removing the biases and reducing the high inter-GCM scattering. Differences with other bias-corrected GCM data are mainly due to the differences among the reference datasets. This is particularly true for surface downwelling shortwave radiation, which has a significant impact in terms of simulated maize yields. Projections of future yields over West Africa are quite different, depending on the bias-correction method used. However all these projections show a similar relative decreasing trend over the 21st century.
Development of a distributed air pollutant dry deposition modeling framework

International Nuclear Information System (INIS)

Hirabayashi, Satoshi; Kroll, Charles N.; Nowak, David J.

2012-01-01

A distributed air pollutant dry deposition modeling system was developed with a geographic information system (GIS) to enhance the functionality of i-Tree Eco (i-Tree, 2011). With the developed system, temperature, leaf area index (LAI) and air pollutant concentration in a spatially distributed form can be estimated, and based on these and other input variables, dry deposition of carbon monoxide (CO), nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), and particulate matter less than 10 microns (PM10) to trees can be spatially quantified. Employing nationally available road network, traffic volume, air pollutant emission/measurement and meteorological data, the developed system provides a framework for the U.S. city managers to identify spatial patterns of urban forest and locate potential areas for future urban forest planting and protection to improve air quality. To exhibit the usability of the framework, a case study was performed for July and August of 2005 in Baltimore, MD. - Highlights: ► A distributed air pollutant dry deposition modeling system was developed. ► The developed system enhances the functionality of i-Tree Eco. ► The developed system employs nationally available input datasets. ► The developed system is transferable to any U.S. city. ► Future planting and protection spots were visually identified in a case study. - Employing nationally available datasets and a GIS, this study will provide urban forest managers in U.S. cities a framework to quantify and visualize urban forest structure and its air pollution removal effect.
Use of country of birth as an indicator of refugee background in health datasets

Science.gov (United States)

2014-01-01

Background Routine public health databases contain a wealth of data useful for research among vulnerable or isolated groups, who may be under-represented in traditional medical research. Identifying specific vulnerable populations, such as resettled refugees, can be particularly challenging; often country of birth is the sole indicator of whether an individual has a refugee background. The objective of this article was to review strengths and weaknesses of different methodological approaches to identifying resettled refugees and comparison groups from routine health datasets and to propose the application of additional methodological rigour in future research. Discussion Methodological approaches to selecting refugee and comparison groups from existing routine health datasets vary widely and are often explained in insufficient detail. Linked data systems or datasets from specialized refugee health services can accurately select resettled refugee and asylum seeker groups but have limited availability and can be selective. In contrast, country of birth is commonly collected in routine health datasets but a robust method for selecting humanitarian source countries based solely on this information is required. The authors recommend use of national immigration data to objectively identify countries of birth with high proportions of humanitarian entrants, matched by time period to the study dataset. When available, additional migration indicators may help to better understand migration as a health determinant. Methodologically, if multiple countries of birth are combined, the proportion of the sample represented by each country of birth should be included, with sub-analysis of individual countries of birth potentially providing further insights, if population size allows. United Nations-defined world regions provide an objective framework for combining countries of birth when necessary. A comparison group of economic migrants from the same world region may be appropriate
The potential for health risks from intrusion of contaminants into the distribution system from pressure transients.

Science.gov (United States)

LeChevallier, Mark W; Gullick, Richard W; Karim, Mohammad R; Friedman, Melinda; Funk, James E

2003-03-01

The potential for public health risks associated with intrusion of contaminants into water supply distribution systems resulting from transient low or negative pressures is assessed. It is shown that transient pressure events occur in distribution systems; that during these negative pressure events pipeline leaks provide a potential portal for entry of groundwater into treated drinking water; and that faecal indicators and culturable human viruses are present in the soil and water exterior to the distribution system. To date, all observed negative pressure events have been related to power outages or other pump shutdowns. Although there are insufficient data to indicate whether pressure transients are a substantial source of risk to water quality in the distribution system, mitigation techniques can be implemented, principally the maintenance of an effective disinfectant residual throughout the distribution system, leak control, redesign of air relief venting, and more rigorous application of existing engineering standards. Use of high-speed pressure data loggers and surge modelling may have some merit, but more research is needed.
Electromotive Potential Distribution and Electronic Leak Currents in Working YSZ Based SOCs

DEFF Research Database (Denmark)

Mogensen, Mogens Bjerg; Jacobsen, Torben

2009-01-01

The size of electronic leak currents through the YSZ electrolyte of solid oxide cells have been calculated using basic solid state electrochemical relations and literature data. The distribution of the electromotive potential, of Galvani potential, of concentration of electrons, e, and electron...... holes, h, was also calculated as these parameters are the basis for the understanding of the electronic conductivity that causes the electronic leak currents. The results are illustrated with examples. The effects of electrolyte thickness, temperature and cell voltage on the electronic leak current...
Potential pitfalls of single phasing operation in a three phase distribution network

Energy Technology Data Exchange (ETDEWEB)

Narayanan, V S

1986-07-01

Finding it difficult to cope with the increased demand for electric power, some electricity boards have resorted to single phasing techniques in distribution system. This practice is harmful to the equipment in the power system. Some of the potential dangers associated with this undesirable practice are briefly discussed.
Construction and Analysis of Long-Term Surface Temperature Dataset in Fujian Province

Science.gov (United States)

Li, W. E.; Wang, X. Q.; Su, H.

2017-09-01

Land surface temperature (LST) is a key parameter of land surface physical processes on global and regional scales, linking the heat fluxes and interactions between the ground and atmosphere. Based on MODIS 8-day LST products (MOD11A2) from the split-window algorithms, we constructed and obtained the monthly and annual LST dataset of Fujian Province from 2000 to 2015. Then, we analyzed the monthly and yearly time series LST data and further investigated the LST distribution and its evolution features. The average LST of Fujian Province reached the highest in July, while the lowest in January. The monthly and annual LST time series present a significantly periodic features (annual and interannual) from 2000 to 2015. The spatial distribution showed that the LST in North and West was lower than South and East in Fujian Province. With the rapid development and urbanization of the coastal area in Fujian Province, the LST in coastal urban region was significantly higher than that in mountainous rural region. The LST distributions might affected by the climate, topography and land cover types. The spatio-temporal distribution characteristics of LST could provide good references for the agricultural layout and environment monitoring in Fujian Province.
CONSTRUCTION AND ANALYSIS OF LONG-TERM SURFACE TEMPERATURE DATASET IN FUJIAN PROVINCE

Directory of Open Access Journals (Sweden)

W. E. Li

2017-09-01

Full Text Available Land surface temperature (LST is a key parameter of land surface physical processes on global and regional scales, linking the heat fluxes and interactions between the ground and atmosphere. Based on MODIS 8-day LST products (MOD11A2 from the split-window algorithms, we constructed and obtained the monthly and annual LST dataset of Fujian Province from 2000 to 2015. Then, we analyzed the monthly and yearly time series LST data and further investigated the LST distribution and its evolution features. The average LST of Fujian Province reached the highest in July, while the lowest in January. The monthly and annual LST time series present a significantly periodic features (annual and interannual from 2000 to 2015. The spatial distribution showed that the LST in North and West was lower than South and East in Fujian Province. With the rapid development and urbanization of the coastal area in Fujian Province, the LST in coastal urban region was significantly higher than that in mountainous rural region. The LST distributions might affected by the climate, topography and land cover types. The spatio-temporal distribution characteristics of LST could provide good references for the agricultural layout and environment monitoring in Fujian Province.
Potential Distribution Predicted for Rhynchophorus ferrugineus in China under Different Climate Warming Scenarios.

Directory of Open Access Journals (Sweden)

Xuezhen Ge

Full Text Available As the primary pest of palm trees, Rhynchophorus ferrugineus (Olivier (Coleoptera: Curculionidae has caused serious harm to palms since it first invaded China. The present study used CLIMEX 1.1 to predict the potential distribution of R. ferrugineus in China according to both current climate data (1981-2010 and future climate warming estimates based on simulated climate data for the 2020s (2011-2040 provided by the Tyndall Center for Climate Change Research (TYN SC 2.0. Additionally, the Ecoclimatic Index (EI values calculated for different climatic conditions (current and future, as simulated by the B2 scenario were compared. Areas with a suitable climate for R. ferrugineus distribution were located primarily in central China according to the current climate data, with the northern boundary of the distribution reaching to 40.1°N and including Tibet, north Sichuan, central Shaanxi, south Shanxi, and east Hebei. There was little difference in the potential distribution predicted by the four emission scenarios according to future climate warming estimates. The primary prediction under future climate warming models was that, compared with the current climate model, the number of highly favorable habitats would increase significantly and expand into northern China, whereas the number of both favorable and marginally favorable habitats would decrease. Contrast analysis of EI values suggested that climate change and the density of site distribution were the main effectors of the changes in EI values. These results will help to improve control measures, prevent the spread of this pest, and revise the targeted quarantine areas.
Distributed and parallel approach for handle and perform huge datasets

Science.gov (United States)

Konopko, Joanna

2015-12-01

Big Data refers to the dynamic, large and disparate volumes of data comes from many different sources (tools, machines, sensors, mobile devices) uncorrelated with each others. It requires new, innovative and scalable technology to collect, host and analytically process the vast amount of data. Proper architecture of the system that perform huge data sets is needed. In this paper, the comparison of distributed and parallel system architecture is presented on the example of MapReduce (MR) Hadoop platform and parallel database platform (DBMS). This paper also analyzes the problem of performing and handling valuable information from petabytes of data. The both paradigms: MapReduce and parallel DBMS are described and compared. The hybrid architecture approach is also proposed and could be used to solve the analyzed problem of storing and processing Big Data.
A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

Science.gov (United States)

Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

2017-10-01

In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.
Estimation of potential distribution of gas hydrate in the northern South China Sea

Science.gov (United States)

Wang, Chunjuan; Du, Dewen; Zhu, Zhiwei; Liu, Yonggang; Yan, Shijuan; Yang, Gang

2010-05-01

Gas hydrate research has significant importance for securing world energy resources, and has the potential to produce considerable economic benefits. Previous studies have shown that the South China Sea is an area that harbors gas hydrates. However, there is a lack of systematic investigations and understanding on the distribution of gas hydrate throughout the region. In this paper, we applied mineral resource quantitative assessment techniques to forecast and estimate the potential distribution of gas hydrate resources in the northern South China Sea. However, current hydrate samples from the South China Sea are too few to produce models of occurrences. Thus, according to similarity and contrast principles of mineral outputs, we can use a similar hydrate-mining environment with sufficient gas hydrate data as a testing ground for modeling northern South China Sea gas hydrate conditions. We selected the Gulf of Mexico, which has extensively studied gas hydrates, to develop predictive models of gas hydrate distributions, and to test errors in the model. Then, we compared the existing northern South China Sea hydrate-mining data with the Gulf of Mexico characteristics, and collated the relevant data into the model. Subsequently, we applied the model to the northern South China Sea to obtain the potential gas hydrate distribution of the area, and to identify significant exploration targets. Finally, we evaluated the reliability of the predicted results. The south seabed area of Taiwan Bank is recommended as a priority exploration target. The Zhujiang Mouth, Southeast Hainan, and Southwest Taiwan Basins, including the South Bijia Basin, also are recommended as exploration target areas. In addition, the method in this paper can provide a useful predictive approach for gas hydrate resource assessment, which gives a scientific basis for construction and implementation of long-term planning for gas hydrate exploration and general exploitation of the seabed of China.
An Analysis on Better Testing than Training Performances on the Iris Dataset

NARCIS (Netherlands)

Schutten, Marten; Wiering, Marco

2016-01-01

The Iris dataset is a well known dataset containing information on three different types of Iris flowers. A typical and popular method for solving classification problems on datasets such as the Iris set is the support vector machine (SVM). In order to do so the dataset is separated in a set used
Technical note: Space-time analysis of rainfall extremes in Italy: clues from a reconciled dataset

Science.gov (United States)

Libertino, Andrea; Ganora, Daniele; Claps, Pierluigi

2018-05-01

Like other Mediterranean areas, Italy is prone to the development of events with significant rainfall intensity, lasting for several hours. The main triggering mechanisms of these events are quite well known, but the aim of developing rainstorm hazard maps compatible with their actual probability of occurrence is still far from being reached. A systematic frequency analysis of these occasional highly intense events would require a complete countrywide dataset of sub-daily rainfall records, but this kind of information was still lacking for the Italian territory. In this work several sources of data are gathered, for assembling the first comprehensive and updated dataset of extreme rainfall of short duration in Italy. The resulting dataset, referred to as the Italian Rainfall Extreme Dataset (I-RED), includes the annual maximum rainfalls recorded in 1 to 24 consecutive hours from more than 4500 stations across the country, spanning the period between 1916 and 2014. A detailed description of the spatial and temporal coverage of the I-RED is presented, together with an exploratory statistical analysis aimed at providing preliminary information on the climatology of extreme rainfall at the national scale. Due to some legal restrictions, the database can be provided only under certain conditions. Taking into account the potentialities emerging from the analysis, a description of the ongoing and planned future work activities on the database is provided.
Interactive visualization and analysis of multimodal datasets for surgical applications.

Science.gov (United States)

Kirmizibayrak, Can; Yim, Yeny; Wakid, Mike; Hahn, James

2012-12-01

Surgeons use information from multiple sources when making surgical decisions. These include volumetric datasets (such as CT, PET, MRI, and their variants), 2D datasets (such as endoscopic videos), and vector-valued datasets (such as computer simulations). Presenting all the information to the user in an effective manner is a challenging problem. In this paper, we present a visualization approach that displays the information from various sources in a single coherent view. The system allows the user to explore and manipulate volumetric datasets, display analysis of dataset values in local regions, combine 2D and 3D imaging modalities and display results of vector-based computer simulations. Several interaction methods are discussed: in addition to traditional interfaces including mouse and trackers, gesture-based natural interaction methods are shown to control these visualizations with real-time performance. An example of a medical application (medialization laryngoplasty) is presented to demonstrate how the combination of different modalities can be used in a surgical setting with our approach.
Distributed Parallel Endmember Extraction of Hyperspectral Data Based on Spark

Directory of Open Access Journals (Sweden)

Zebin Wu

2016-01-01

Full Text Available Due to the increasing dimensionality and volume of remotely sensed hyperspectral data, the development of acceleration techniques for massive hyperspectral image analysis approaches is a very important challenge. Cloud computing offers many possibilities of distributed processing of hyperspectral datasets. This paper proposes a novel distributed parallel endmember extraction method based on iterative error analysis that utilizes cloud computing principles to efficiently process massive hyperspectral data. The proposed method takes advantage of technologies including MapReduce programming model, Hadoop Distributed File System (HDFS, and Apache Spark to realize distributed parallel implementation for hyperspectral endmember extraction, which significantly accelerates the computation of hyperspectral processing and provides high throughput access to large hyperspectral data. The experimental results, which are obtained by extracting endmembers of hyperspectral datasets on a cloud computing platform built on a cluster, demonstrate the effectiveness and computational efficiency of the proposed method.
Stability of Spatial Distributions of Stink Bugs, Boll Injury, and NDVI in Cotton.

Science.gov (United States)

Reay-Jones, Francis P F; Greene, Jeremy K; Bauer, Philip J

2016-10-01

A 3-yr study was conducted to determine the degree of aggregation of stink bugs and boll injury in cotton, Gossypium hirsutum L., and their spatial association with a multispectral vegetation index (normalized difference vegetation index [NDVI]). Using the spatial analysis by distance indices analyses, stink bugs were less frequently aggregated (17% for adults and 4% for nymphs) than boll injury (36%). NDVI values were also significantly aggregated within fields in 19 of 48 analyses (40%), with the majority of significant indices occurring in July and August. Paired NDVI datasets from different sampling dates were frequently associated (86.5% for weekly intervals among datasets). Spatial distributions of both stink bugs and boll injury were less stable than for NDVI, with positive associations varying from 12.5 to 25% for adult stink bugs for weekly intervals, depending on species. Spatial distributions of boll injury from stink bug feeding were more stable than stink bugs, with 46% positive associations among paired datasets with weekly intervals. NDVI values were positively associated with boll injury from stink bug feeding in 11 out of 22 analyses, with no significant negative associations. This indicates that NDVI has potential as a component of site-specific management. Future work should continue to examine the value of remote sensing for insect management in cotton, with an aim to develop tools such as risk assessment maps that will help growers to reduce insecticide inputs. © The Authors 2016. Published by Oxford University Press on behalf of Entomological Society of America. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Exploring the potential uptake of distributed energy generation

International Nuclear Information System (INIS)

Gardner, John; Ashworth, Peta; Carr-Cornish, Simone

2007-01-01

Full text: Global warming has been identified as an energy problem (Klare 2007). With a predicted increase in fossil fuel use for many years to come (IEA 2004) there is a need to find a future energy path that will meet our basic requirements for energy but also help to mitigate climate change (CSIRO 2006). Currently there are a range of technological solutions available, with each representing a different value proposition. Distributed Energy (DE) is one such technological solution, which involves the widespread use of small local power generators, located close to the end user. Such generators can be powered by a range of low emission and/or renewable sources. Until now, cheap electricity, existing infrastructure and reluctance for change both at a political and individual level has meant there has been little prospect for DE to be considered in Australia, except in some remote communities. However, with the majority of Australians now rating climate change as an issue of strategic importance to Australia (Ashworth, Pisarski and Littleboy 2006), it can be inferred that Australia's tolerance for generating greenhouse gas emissions has reduced, and that potential support for DE is increasing. It is therefore important to understand what factors might influence the potential adoption of DE. As part of a research project called the Intelligent Grid, CSIRO's Energy Transformed Flagship is aiming to identify the conditions under which Distributed Energy might be effectively implemented in Australia. One component of this project involves social research, which aims to understand the drivers and barriers to the uptake of DE technology by the community. This paper presents findings from two large-scale surveys (one of householders and one of businesses), designed to assess beliefs and knowledge about environmental issues, and about traditional and renewable energy sources. The surveys also assess current energy use, and identify preferences regarding DE technology. The
Something From Nothing (There): Collecting Global IPv6 Datasets from DNS

NARCIS (Netherlands)

Fiebig, T.; Borgolte, Kevin; Hao, Shuang; Kruegel, Christopher; Vigna, Giovanny; Spring, Neil; Riley, George F.

2017-01-01

Current large-scale IPv6 studies mostly rely on non-public datasets, asmost public datasets are domain specific. For instance, traceroute-based datasetsare biased toward network equipment. In this paper, we present a new methodologyto collect IPv6 address datasets that does not require access to

SPREAD: a high-resolution daily gridded precipitation dataset for Spain – an extreme events frequency and intensity overview

Directory of Open Access Journals (Sweden)

R. Serrano-Notivoli

2017-09-01

Full Text Available A high-resolution daily gridded precipitation dataset was built from raw data of 12 858 observatories covering a period from 1950 to 2012 in peninsular Spain and 1971 to 2012 in Balearic and Canary islands. The original data were quality-controlled and gaps were filled on each day and location independently. Using the serially complete dataset, a grid with a 5 × 5 km spatial resolution was constructed by estimating daily precipitation amounts and their corresponding uncertainty at each grid node. Daily precipitation estimations were compared to original observations to assess the quality of the gridded dataset. Four daily precipitation indices were computed to characterise the spatial distribution of daily precipitation and nine extreme precipitation indices were used to describe the frequency and intensity of extreme precipitation events. The Mediterranean coast and the Central Range showed the highest frequency and intensity of extreme events, while the number of wet days and dry and wet spells followed a north-west to south-east gradient in peninsular Spain, from high to low values in the number of wet days and wet spells and reverse in dry spells. The use of the total available data in Spain, the independent estimation of precipitation for each day and the high spatial resolution of the grid allowed for a precise spatial and temporal assessment of daily precipitation that is difficult to achieve when using other methods, pre-selected long-term stations or global gridded datasets. SPREAD dataset is publicly available at https://doi.org/10.20350/digitalCSIC/7393.
Automatic processing of multimodal tomography datasets.

Science.gov (United States)

Parsons, Aaron D; Price, Stephen W T; Wadeson, Nicola; Basham, Mark; Beale, Andrew M; Ashton, Alun W; Mosselmans, J Frederick W; Quinn, Paul D

2017-01-01

With the development of fourth-generation high-brightness synchrotrons on the horizon, the already large volume of data that will be collected on imaging and mapping beamlines is set to increase by orders of magnitude. As such, an easy and accessible way of dealing with such large datasets as quickly as possible is required in order to be able to address the core scientific problems during the experimental data collection. Savu is an accessible and flexible big data processing framework that is able to deal with both the variety and the volume of data of multimodal and multidimensional scientific datasets output such as those from chemical tomography experiments on the I18 microfocus scanning beamline at Diamond Light Source.
GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.

Science.gov (United States)

Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

2015-07-02

A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.
Testing the efficacy of downscaling in species distribution modelling: a comparison between MaxEnt and Favourability Function models

Energy Technology Data Exchange (ETDEWEB)

Olivero, J.; Toxopeus, A.G.; Skidmore, A.K.; Real, R.

2016-07-01

Statistical downscaling is used to improve the knowledge of spatial distributions from broad–scale to fine–scale maps with higher potential for conservation planning. We assessed the effectiveness of downscaling in two commonly used species distribution models: Maximum Entropy (MaxEnt) and the Favourability Function (FF). We used atlas data (10 x 10 km) of the fire salamander Salamandra salamandra distribution in southern Spain to derive models at a 1 x 1 km resolution. Downscaled models were assessed using an independent dataset of the species’ distribution at 1 x 1 km. The Favourability model showed better downscaling performance than the MaxEnt model, and the models that were based on linear combinations of environmental variables performed better than models allowing higher flexibility. The Favourability model minimized model overfitting compared to the MaxEnt model. (Author)
A Research Graph dataset for connecting research data repositories using RD-Switchboard.

Science.gov (United States)

Aryani, Amir; Poblet, Marta; Unsworth, Kathryn; Wang, Jingbo; Evans, Ben; Devaraju, Anusuriya; Hausstein, Brigitte; Klas, Claus-Peter; Zapilko, Benjamin; Kaplun, Samuele

2018-05-29

This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants. The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation. Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.
Geographic conditions for distribution of agriculture and potentials for tourism development on Mokra mountain

Directory of Open Access Journals (Sweden)

Knežević Marko

2009-01-01

Full Text Available This work considers important natural conditions of distribution of agricultural production, cattle breeding in particular, and also potentials for tourism development on Mokra mountain. Half-nomadic cattle breeding in mountain settlements was highly developed in recent past. It represented the main source of existence for local highlanders. Today it is neglected and in phase of dying out. The mountain disposes with excellent natural potentials for ecological and mountain tourism, but these potentials are unused.
The sound of migration: exploring data sonification as a means of interpreting multivariate salmon movement datasets

Directory of Open Access Journals (Sweden)

Jens C. Hegg

2018-02-01

Full Text Available The migration of Pacific salmon is an important part of functioning freshwater ecosystems, but as populations have decreased and ecological conditions have changed, so have migration patterns. Understanding how the environment, and human impacts, change salmon migration behavior requires observing migration at small temporal and spatial scales across large geographic areas. Studying these detailed fish movements is particularly important for one threatened population of Chinook salmon in the Snake River of Idaho whose juvenile behavior may be rapidly evolving in response to dams and anthropogenic impacts. However, exploring movement data sets of large numbers of salmon can present challenges due to the difficulty of visualizing the multivariate, time-series datasets. Previous research indicates that sonification, representing data using sound, has the potential to enhance exploration of multivariate, time-series datasets. We developed sonifications of individual fish movements using a large dataset of salmon otolith microchemistry from Snake River Fall Chinook salmon. Otoliths, a balance and hearing organ in fish, provide a detailed chemical record of fish movements recorded in the tree-like rings they deposit each day the fish is alive. This data represents a scalable, multivariate dataset of salmon movement ideal for sonification. We tested independent listener responses to validate the effectiveness of the sonification tool and mapping methods. The sonifications were presented in a survey to untrained listeners to identify salmon movements with increasingly more fish, with and without visualizations. Our results showed that untrained listeners were most sensitive to transitions mapped to pitch and timbre. Accuracy results were non-intuitive; in aggregate, respondents clearly identified important transitions, but individual accuracy was low. This aggregate effect has potential implications for the use of sonification in the context of crowd
Veterans Affairs Suicide Prevention Synthetic Dataset

Data.gov (United States)

Department of Veterans Affairs — The VA's Veteran Health Administration, in support of the Open Data Initiative, is providing the Veterans Affairs Suicide Prevention Synthetic Dataset (VASPSD). The...
SAR image classification based on CNN in real and simulation datasets

Science.gov (United States)

Peng, Lijiang; Liu, Ming; Liu, Xiaohua; Dong, Liquan; Hui, Mei; Zhao, Yuejin

2018-04-01

Convolution neural network (CNN) has made great success in image classification tasks. Even in the field of synthetic aperture radar automatic target recognition (SAR-ATR), state-of-art results has been obtained by learning deep representation of features on the MSTAR benchmark. However, the raw data of MSTAR have shortcomings in training a SAR-ATR model because of high similarity in background among the SAR images of each kind. This indicates that the CNN would learn the hierarchies of features of backgrounds as well as the targets. To validate the influence of the background, some other SAR images datasets have been made which contains the simulation SAR images of 10 manufactured targets such as tank and fighter aircraft, and the backgrounds of simulation SAR images are sampled from the whole original MSTAR data. The simulation datasets contain the dataset that the backgrounds of each kind images correspond to the one kind of backgrounds of MSTAR targets or clutters and the dataset that each image shares the random background of whole MSTAR targets or clutters. In addition, mixed datasets of MSTAR and simulation datasets had been made to use in the experiments. The CNN architecture proposed in this paper are trained on all datasets mentioned above. The experimental results shows that the architecture can get high performances on all datasets even the backgrounds of the images are miscellaneous, which indicates the architecture can learn a good representation of the targets even though the drastic changes on background.
Really big data: Processing and analysis of large datasets

Science.gov (United States)

Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidl...
A robust dataset-agnostic heart disease classifier from Phonocardiogram.

Science.gov (United States)

Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M

2017-07-01

Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.
A Comparative Analysis of Classification Algorithms on Diverse Datasets

Directory of Open Access Journals (Sweden)

M. Alghobiri

2018-04-01

Full Text Available Data mining involves the computational process to find patterns from large data sets. Classification, one of the main domains of data mining, involves known structure generalizing to apply to a new dataset and predict its class. There are various classification algorithms being used to classify various data sets. They are based on different methods such as probability, decision tree, neural network, nearest neighbor, boolean and fuzzy logic, kernel-based etc. In this paper, we apply three diverse classification algorithms on ten datasets. The datasets have been selected based on their size and/or number and nature of attributes. Results have been discussed using some performance evaluation measures like precision, accuracy, F-measure, Kappa statistics, mean absolute error, relative absolute error, ROC Area etc. Comparative analysis has been carried out using the performance evaluation measures of accuracy, precision, and F-measure. We specify features and limitations of the classification algorithms for the diverse nature datasets.
Soil moisture datasets at five sites in the central Sierra Nevada and northern Coast Ranges, California

Science.gov (United States)

Stern, Michelle A.; Anderson, Frank A.; Flint, Lorraine E.; Flint, Alan L.

2018-05-03

In situ soil moisture datasets are important inputs used to calibrate and validate watershed, regional, or statewide modeled and satellite-based soil moisture estimates. The soil moisture dataset presented in this report includes hourly time series of the following: soil temperature, volumetric water content, water potential, and total soil water content. Data were collected by the U.S. Geological Survey at five locations in California: three sites in the central Sierra Nevada and two sites in the northern Coast Ranges. This report provides a description of each of the study areas, procedures and equipment used, processing steps, and time series data from each site in the form of comma-separated values (.csv) tables.
The wildland-urban interface raster dataset of Catalonia

Directory of Open Access Journals (Sweden)

Fermín J. Alcasena

2018-04-01

Full Text Available We provide the wildland urban interface (WUI map of the autonomous community of Catalonia (Northeastern Spain. The map encompasses an area of some 3.21 million ha and is presented as a 150-m resolution raster dataset. Individual housing location, structure density and vegetation cover data were used to spatially assess in detail the interface, intermix and dispersed rural WUI communities with a geographical information system. Most WUI areas concentrate in the coastal belt where suburban sprawl has occurred nearby or within unmanaged forests. This geospatial information data provides an approximation of residential housing potential for loss given a wildfire, and represents a valuable contribution to assist landscape and urban planning in the region. Keywords: Wildland-urban interface, Wildfire risk, Urban planning, Human communities, Catalonia
Strontium removal jar test dataset for all figures and tables.

Data.gov (United States)

U.S. Environmental Protection Agency — The datasets where used to generate data to demonstrate strontium removal under various water quality and treatment conditions. This dataset is associated with the...
Development of a SPARK Training Dataset

Energy Technology Data Exchange (ETDEWEB)

Sayre, Amanda M. [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Olson, Jarrod R. [Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

2015-03-01

In its first five years, the National Nuclear Security Administration’s (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and across locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK’s intended analysis capability. The analysis demonstration sought to answer the
A survey and experimental comparison of distributed SPARQL engines for very large RDF data

KAUST Repository

Abdelaziz, Ibrahim; Harbi, Razen; Khayyat, Zuhair; Kalnis, Panos

2017-01-01

Distributed SPARQL engines promise to support very large RDF datasets by utilizing shared-nothing computer clusters. Some are based on distributed frameworks such as MapReduce; others implement proprietary distributed processing; and some rely on expensive preprocessing for data partitioning. These systems exhibit a variety of trade-offs that are not well-understood, due to the lack of any comprehensive quantitative and qualitative evaluation. In this paper, we present a survey of 22 state-of-the-art systems that cover the entire spectrum of distributed RDF data processing and categorize them by several characteristics. Then, we select 12 representative systems and perform extensive experimental evaluation with respect to preprocessing cost, query performance, scalability and workload adaptability, using a variety of synthetic and real large datasets with up to 4.3 billion triples. Our results provide valuable insights for practitioners to understand the trade-offs for their usage scenarios. Finally, we publish online our evaluation framework, including all datasets and workloads, for researchers to compare their novel systems against the existing ones.
A survey and experimental comparison of distributed SPARQL engines for very large RDF data

KAUST Repository

Abdelaziz, Ibrahim

2017-10-19

Distributed SPARQL engines promise to support very large RDF datasets by utilizing shared-nothing computer clusters. Some are based on distributed frameworks such as MapReduce; others implement proprietary distributed processing; and some rely on expensive preprocessing for data partitioning. These systems exhibit a variety of trade-offs that are not well-understood, due to the lack of any comprehensive quantitative and qualitative evaluation. In this paper, we present a survey of 22 state-of-the-art systems that cover the entire spectrum of distributed RDF data processing and categorize them by several characteristics. Then, we select 12 representative systems and perform extensive experimental evaluation with respect to preprocessing cost, query performance, scalability and workload adaptability, using a variety of synthetic and real large datasets with up to 4.3 billion triples. Our results provide valuable insights for practitioners to understand the trade-offs for their usage scenarios. Finally, we publish online our evaluation framework, including all datasets and workloads, for researchers to compare their novel systems against the existing ones.
SIAM 2007 Text Mining Competition dataset

Data.gov (United States)

National Aeronautics and Space Administration — Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining...
Environmental Dataset Gateway (EDG) REST Interface

Data.gov (United States)

U.S. Environmental Protection Agency — Use the Environmental Dataset Gateway (EDG) to find and access EPA's environmental resources. Many options are available for easily reusing EDG content in other...

Redband Trout Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for REDBAND TROUT contained in the StreamNet database. This feature class was created based on linear...
White Sturgeon Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for WHITE STURGEON contained in the StreamNet database. This feature class was created based on linear...
Rainbow Trout Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for RAINBOW TROUT contained in the StreamNet database. This feature class was created based on linear...
Winter Steelhead Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for WINTER STEELHEAD contained in the StreamNet database. This feature class was created based on linear...
Potential distribution of Podocnemis lewyana (Reptilia: Podocnemididae) and its possible fluctuation under different global climate change scenarios

International Nuclear Information System (INIS)

Ortiz Yusty, Carlos; Restrepo, Adriana; Paez, Vivian P

2014-01-01

We implemented a species distribution modelling approach to establish the potential distribution of Podocnemis lewyana, to explore the climatic factors that may influence the species' distribution and to evaluate possible changes in distribution under future climate scenarios. The distribution models predicted a continuous distribution from south to north along the Magdalena River, from Rivera and Palermo in the Department of Huila to the departments of Atlantico and Magdalena in the north. Temperature was the variable most influential in the distribution of P. lewyana; this species tends to be present in warm regions with low temperature variability. The distribution model predicted an increase in the geographic range of P. lewyana under climate change scenarios. However, taking into account the habitat preferences of this species and its strong association with water, this result should be treated with caution since the model considered only terrestrial climatic variables. Given the life history characteristics of this species (temperature dependent sex determination, high pivotal temperature and a very narrow transition range) and the negative effect of changes in hydrological regimes on embryo survival, expansion of the potential distribution of P. lewyana in the future does not mean that the species will not be affected by global climate change.
Predicting the potential distribution of the amphibian pathogen Batrachochytrium dendrobatidis in East and Southeast Asia.

Science.gov (United States)

Moriguchi, Sachiko; Tominaga, Atsushi; Irwin, Kelly J; Freake, Michael J; Suzuki, Kazutaka; Goka, Koichi

2015-04-08

Batrachochytrium dendrobatidis (Bd) is the pathogen responsible for chytridiomycosis, a disease that is associated with a worldwide amphibian population decline. In this study, we predicted the potential distribution of Bd in East and Southeast Asia based on limited occurrence data. Our goal was to design an effective survey area where efforts to detect the pathogen can be focused. We generated ecological niche models using the maximum-entropy approach, with alleviation of multicollinearity and spatial autocorrelation. We applied eigenvector-based spatial filters as independent variables, in addition to environmental variables, to resolve spatial autocorrelation, and compared the model's accuracy and the degree of spatial autocorrelation with those of a model estimated using only environmental variables. We were able to identify areas of high suitability for Bd with accuracy. Among the environmental variables, factors related to temperature and precipitation were more effective in predicting the potential distribution of Bd than factors related to land use and cover type. Our study successfully predicted the potential distribution of Bd in East and Southeast Asia. This information should now be used to prioritize survey areas and generate a surveillance program to detect the pathogen.
The impacts of racial group membership on people's distributive justice: an event-related potential study.

Science.gov (United States)

Wang, Yan; Tang, Yi-Yuan; Deng, Yuqin

2014-04-16

How individuals and societies distribute benefits has long been studied by psychologists and sociologists. Previous work has highlighted the importance of social identity on people's justice concerns. However, it is not entirely clear how racial in-group/out-group relationship affects the brain activity in distributive justice. In this study, event-related potentials were recorded while participants made their decisions about donation allocation. Behavioral results showed that racial in-group factor affected participants' decisions on justice consideration. Participants were more likely to make relatively equity decisions when racial in-group factor was congruent with equity compared with the corresponding incongruent condition. Moreover, this incongruent condition took longer response times than congruent condition. Meanwhile, less equity decisions were made when efficiency was larger in the opposite side to equity than it was equal between the two options. Scalp event-related potential analyses revealed that greater P300 and late positive potential amplitudes were elicited by the incongruent condition compared with the congruent condition. These findings suggest that the decision-making of distributive justice could be modulated by racial group membership, and greater attentional resources or cognitive efforts are required when racial in-group factor and equity conflict with each other.
Age, Gender, and Fine-Grained Ethnicity Prediction using Convolutional Neural Networks for the East Asian Face Dataset

Energy Technology Data Exchange (ETDEWEB)

Srinivas, Nisha [ORNL; Rose, Derek C [ORNL; Bolme, David S [ORNL; Mahalingam, Gayathri [ORNL; Atwal, Harleen [ORNL; Ricanek, Karl [ORNL

2017-01-01

This paper examines the difficulty associated with performing machine-based automatic demographic prediction on a sub-population of Asian faces. We introduce the Wild East Asian Face dataset (WEAFD), a new and unique dataset to the research community. This dataset consists primarily of labeled face images of individuals from East Asian countries, including Vietnam, Burma, Thailand, China, Korea, Japan, Indonesia, and Malaysia. East Asian turk annotators were uniquely used to judge the age and fine grain ethnicity attributes to reduce the impact of the other race effect and improve quality of annotations. We focus on predicting age, gender and fine-grained ethnicity of an individual by providing baseline results with a convolutional neural network (CNN). Finegrained ethnicity prediction refers to predicting ethnicity of an individual by country or sub-region (Chinese, Japanese, Korean, etc.) of the East Asian continent. Performance for two CNN architectures is presented, highlighting the difficulty of these tasks and showcasing potential design considerations that ease network optimization by promoting region based feature extraction.
STAMMEX high resolution gridded daily precipitation dataset over Germany: a new potential for regional precipitation climate research

Science.gov (United States)

Zolina, Olga; Simmer, Clemens; Kapala, Alice; Mächel, Hermann; Gulev, Sergey; Groisman, Pavel

2014-05-01

We present new high resolution precipitation daily grids developed at Meteorological Institute, University of Bonn and German Weather Service (DWD) under the STAMMEX project (Spatial and Temporal Scales and Mechanisms of Extreme Precipitation Events over Central Europe). Daily precipitation grids have been developed from the daily-observing precipitation network of DWD, which runs one of the World's densest rain gauge networks comprising more than 7500 stations. Several quality-controlled daily gridded products with homogenized sampling were developed covering the periods 1931-onwards (with 0.5 degree resolution), 1951-onwards (0.25 degree and 0.5 degree), and 1971-2000 (0.1 degree). Different methods were tested to select the best gridding methodology that minimizes errors of integral grid estimates over hilly terrain. Besides daily precipitation values with uncertainty estimates (which include standard estimates of the kriging uncertainty as well as error estimates derived by a bootstrapping algorithm), the STAMMEX data sets include a variety of statistics that characterize temporal and spatial dynamics of the precipitation distribution (quantiles, extremes, wet/dry spells, etc.). Comparisons with existing continental-scale daily precipitation grids (e.g., CRU, ECA E-OBS, GCOS) which include considerably less observations compared to those used in STAMMEX, demonstrate the added value of high-resolution grids for extreme rainfall analyses. These data exhibit spatial variability pattern and trends in precipitation extremes, which are missed or incorrectly reproduced over Central Europe from coarser resolution grids based on sparser networks. The STAMMEX dataset can be used for high-quality climate diagnostics of precipitation variability, as a reference for reanalyses and remotely-sensed precipitation products (including the upcoming Global Precipitation Mission products), and for input into regional climate and operational weather forecast models. We will present
Distributed Data Management Service for VPH Applications

NARCIS (Netherlands)

Koulouzis, S.; Belloum, A.; Bubak, M.; Lamata, P.; Nolte, D.; Vasyunin, D.; de Laat, C.

2016-01-01

For many medical applications, it's challenging to access large datasets, which are often hosted across different domains on heterogeneous infrastructures. Homogenizing the infrastructure to simplify data access is unrealistic; therefore, it's important to develop distributed storage that doesn't
The case for developing publicly-accessible datasets for health services research in the Middle East and North Africa (MENA region

Directory of Open Access Journals (Sweden)

El-Jardali Fadi

2009-10-01

Full Text Available Abstract Background The existence of publicly-accessible datasets comprised a significant opportunity for health services research to evolve into a science that supports health policy making and evaluation, proper inter- and intra-organizational decisions and optimal clinical interventions. This paper investigated the role of publicly-accessible datasets in the enhancement of health care systems in the developed world and highlighted the importance of their wide existence and use in the Middle East and North Africa (MENA region. Discussion A search was conducted to explore the availability of publicly-accessible datasets in the MENA region. Although datasets were found in most countries in the region, those were limited in terms of their relevance, quality and public-accessibility. With rare exceptions, publicly-accessible datasets - as present in the developed world - were absent. Based on this, we proposed a gradual approach and a set of recommendations to promote the development and use of publicly-accessible datasets in the region. These recommendations target potential actions by governments, researchers, policy makers and international organizations. Summary We argue that the limited number of publicly-accessible datasets in the MENA region represents a lost opportunity for the evidence-based advancement of health systems in the region. The availability and use of publicly-accessible datasets would encourage policy makers in this region to base their decisions on solid representative data and not on estimates or small-scale studies; researchers would be able to exercise their expertise in a meaningful manner to both, policy makers and the public. The population of the MENA countries would exercise the right to benefit from locally- or regionally-based studies, versus imported and in 'best cases' customized ones. Furthermore, on a macro scale, the availability of regionally comparable publicly-accessible datasets would allow for the
Brown Trout Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for BROWN TROUT contained in the StreamNet database. This feature class was created based on linear event...
Brook Trout Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for BROOK TROUT contained in the StreamNet database. This feature class was created based on linear event...
Chum Salmon Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for CHUM SALMON contained in the StreamNet database. This feature class was created based on linear event...
Coho Salmon Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for COHO SALMON contained in the StreamNet database. This feature class was created based on linear event...
Pink Salmon Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for PINK SALMON contained in the StreamNet database. This feature class was created based on linear event...
Fall Chinook Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for FALL CHINOOK contained in the StreamNet database. This feature class was created based on linear event...
A curated compendium of monocyte transcriptome datasets of relevance to human monocyte immunobiology research [version 2; referees: 2 approved

Directory of Open Access Journals (Sweden)

Darawan Rinchai

2016-04-01

Full Text Available Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB. This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp.
A collection of annotated and harmonized human breast cancer transcriptome datasets, including immunologic classification [version 2; referees: 2 approved

Directory of Open Access Journals (Sweden)

Jessica Roelands

2018-02-01

Full Text Available The increased application of high-throughput approaches in translational research has expanded the number of publicly available data repositories. Gathering additional valuable information contained in the datasets represents a crucial opportunity in the biomedical field. To facilitate and stimulate utilization of these datasets, we have recently developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB. In this note, we describe a curated compendium of 13 public datasets on human breast cancer, representing a total of 2142 transcriptome profiles. We classified the samples according to different immune based classification systems and integrated this information into the datasets. Annotated and harmonized datasets were uploaded to GXB. Study samples were categorized in different groups based on their immunologic tumor response profiles, intrinsic molecular subtypes and multiple clinical parameters. Ranked gene lists were generated based on relevant group comparisons. In this data note, we demonstrate the utility of GXB to evaluate the expression of a gene of interest, find differential gene expression between groups and investigate potential associations between variables with a specific focus on immunologic classification in breast cancer. This interactive resource is publicly available online at: http://breastcancer.gxbsidra.org/dm3/geneBrowser/list.
Distribution functions to estimate radionuclide solid-liquid distribution coefficients in soils: the case of Cs

Energy Technology Data Exchange (ETDEWEB)

Ramirez-Guinart, Oriol; Rigol, Anna; Vidal, Miquel [Analytical Chemistry department, Faculty of Chemistry, University of Barcelona, Mart i Franques 1-11, 08028, Barcelona (Spain)

2014-07-01

In the frame of the revision of the IAEA TRS 364 (Handbook of parameter values for the prediction of radionuclide transfer in temperate environments), a database of radionuclide solid-liquid distribution coefficients (K{sub d}) in soils was compiled with data coming from field and laboratory experiments, from references mostly from 1990 onwards, including data from reports, reviewed papers, and grey literature. The K{sub d} values were grouped for each radionuclide according to two criteria. The first criterion was based on the sand and clay mineral percentages referred to the mineral matter, and the organic matter (OM) content in the soil. This defined the 'texture/OM' criterion. The second criterion was to group soils regarding specific soil factors governing the radionuclide-soil interaction ('cofactor' criterion). The cofactors depended on the radionuclide considered. An advantage of using cofactors was that the variability of K{sub d} ranges for a given soil group decreased considerably compared with that observed when the classification was based solely on sand, clay and organic matter contents. The K{sub d} best estimates were defined as the calculated GM values assuming that K{sub d} values were always log-normally distributed. Risk assessment models may require as input data for a given parameter either a single value (a best estimate) or a continuous function from which not only individual best estimates but also confidence ranges and data variability can be derived. In the case of the K{sub d} parameter, a suitable continuous function which contains the statistical parameters (e.g. arithmetical/geometric mean, arithmetical/geometric standard deviation, mode, etc.) that better explain the distribution among the K{sub d} values of a dataset is the Cumulative Distribution Function (CDF). To our knowledge, appropriate CDFs has not been proposed for radionuclide K{sub d} in soils yet. Therefore, the aim of this works is to create CDFs for

Harvard Aging Brain Study: Dataset and accessibility.

Science.gov (United States)

Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G; Chatwal, Jasmeer P; Papp, Kathryn V; Amariglio, Rebecca E; Blacker, Deborah; Rentz, Dorene M; Johnson, Keith A; Sperling, Reisa A; Schultz, Aaron P

2017-01-01

The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging. To promote more extensive analyses, imaging data was designed to be compatible with other publicly available datasets. A cloud-based system enables access to interested researchers with blinded data available contingent upon completion of a data usage agreement and administrative approval. Data collection is ongoing and currently in its fifth year. Copyright © 2015 Elsevier Inc. All rights reserved.
Querying Large Biological Network Datasets

Science.gov (United States)

Gulsoy, Gunhan

2013-01-01

New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…
BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten Bangla Isolated characters

Directory of Open Access Journals (Sweden)

Mithun Biswas

2017-06-01

Full Text Available BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.
An open, multi-vendor, multi-field-strength brain MR dataset and analysis of publicly available skull stripping methods agreement.

Science.gov (United States)

Souza, Roberto; Lucena, Oeslle; Garrafa, Julia; Gobbi, David; Saluzzi, Marina; Appenzeller, Simone; Rittner, Letícia; Frayne, Richard; Lotufo, Roberto

2018-04-15

This paper presents an open, multi-vendor, multi-field strength magnetic resonance (MR) T1-weighted volumetric brain imaging dataset, named Calgary-Campinas-359 (CC-359). The dataset is composed of images of older healthy adults (29-80 years) acquired on scanners from three vendors (Siemens, Philips and General Electric) at both 1.5 T and 3 T. CC-359 is comprised of 359 datasets, approximately 60 subjects per vendor and magnetic field strength. The dataset is approximately age and gender balanced, subject to the constraints of the available images. It provides consensus brain extraction masks for all volumes generated using supervised classification. Manual segmentation results for twelve randomly selected subjects performed by an expert are also provided. The CC-359 dataset allows investigation of 1) the influences of both vendor and magnetic field strength on quantitative analysis of brain MR; 2) parameter optimization for automatic segmentation methods; and potentially 3) machine learning classifiers with big data, specifically those based on deep learning methods, as these approaches require a large amount of data. To illustrate the utility of this dataset, we compared to the results of a supervised classifier, the results of eight publicly available skull stripping methods and one publicly available consensus algorithm. A linear mixed effects model analysis indicated that vendor (p-valuefield strength (p-value<0.001) have statistically significant impacts on skull stripping results. Copyright © 2017 Elsevier Inc. All rights reserved.
Efficient and Flexible Climate Analysis with Python in a Cloud-Based Distributed Computing Framework

Science.gov (United States)

Gannon, C.

2017-12-01

As climate models become progressively more advanced, and spatial resolution further improved through various downscaling projects, climate projections at a local level are increasingly insightful and valuable. However, the raw size of climate datasets presents numerous hurdles for analysts wishing to develop customized climate risk metrics or perform site-specific statistical analysis. Four Twenty Seven, a climate risk consultancy, has implemented a Python-based distributed framework to analyze large climate datasets in the cloud. With the freedom afforded by efficiently processing these datasets, we are able to customize and continually develop new climate risk metrics using the most up-to-date data. Here we outline our process for using Python packages such as XArray and Dask to evaluate netCDF files in a distributed framework, StarCluster to operate in a cluster-computing environment, cloud computing services to access publicly hosted datasets, and how this setup is particularly valuable for generating climate change indicators and performing localized statistical analysis.
Glycogen distribution in adult and geriatric mice brains

KAUST Repository

Alrabeh, Rana

2017-05-01

Astrocytes, the most abundant glial cell type in the brain, undergo a number of roles in brain physiology; among them, the energetic support of neurons is the best characterized. Contained within astrocytes is the brain’s obligate energy store, glycogen. Through glycogenolysis, glycogen, a storage form of glucose, is converted to pyruvate that is further reduced to lactate and transferred to neurons as an energy source via MCTs. Glycogen is a multi-branched polysaccharide synthesized from the glucose uptaken in astrocytes. It has been shown that glycogen accumulates with age and contributes to the physiological ageing process in the brain. In this study, we compared glycogen distribution between young adults and geriatric mice to understand the energy consumption of synaptic terminals during ageing using computational tools. We segmented and densely reconstructed neuropil and glycogen granules within six (three 4 month old old and three 24 month old) volumes of Layer 1 somatosensory cortex mice brains from FIB-SEM stacks, using a combination of semi-automated and manual tools, ilastik and TrakEM2. Finally, the 3D visualization software, Blender, was used to analyze the dataset using the DBSCAN and KDTree Nearest neighbor algorithms to study the distribution of glycogen granules compared to synapses, using a plugin that was developed for this purpose. The Nearest Neighbors and clustering results of 6 datasets show that glycogen clusters around excitatory synapses more than inhibitory synapses and that, in general, glycogen is found around axonal boutons more than dendritic spines. There was no significant accumulation of glycogen with ageing within our admittedly small dataset. However, there was a homogenization of glycogen distribution with age and that is consistent with published literature. We conclude that glycogen distribution in the brain is not a random process but follows a function distribution.
A dataset of human decision-making in teamwork management

Science.gov (United States)

Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

2017-01-01

Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.
Segmental distribution of some common molecular markers for colorectal cancer (CRC): influencing factors and potential implications.

Science.gov (United States)

Papagiorgis, Petros Christakis

2016-05-01

Proximal and distal colorectal cancers (CRCs) are regarded as distinct disease entities, evolving through different genetic pathways and showing multiple clinicopathological and molecular differences. Segmental distribution of some common markers (e.g., KRAS, EGFR, Ki-67, Bcl-2, COX-2) is clinically important, potentially affecting their prognostic or predictive value. However, this distribution is influenced by a variety of factors such as the anatomical overlap of tumorigenic molecular events, associations of some markers with other clinicopathological features (stage and/or grade), and wide methodological variability in markers' assessment. All these factors represent principal influences followed by intratumoral heterogeneity and geographic variation in the frequency of detection of particular markers, whereas the role of other potential influences (e.g., pre-adjuvant treatment, interaction between markers) remains rather unclear. Better understanding and elucidation of the various influences may provide a more accurate picture of the segmental distribution of molecular markers in CRC, potentially allowing the application of a novel patient stratification for treatment, based on particular molecular profiles in combination with tumor location.
EVALUATION OF LAND USE/LAND COVER DATASETS FOR URBAN WATERSHED MODELING

International Nuclear Information System (INIS)

S.J. BURIAN; M.J. BROWN; T.N. MCPHERSON

2001-01-01

Land use/land cover (LULC) data are a vital component for nonpoint source pollution modeling. Most watershed hydrology and pollutant loading models use, in some capacity, LULC information to generate runoff and pollutant loading estimates. Simple equation methods predict runoff and pollutant loads using runoff coefficients or pollutant export coefficients that are often correlated to LULC type. Complex models use input variables and parameters to represent watershed characteristics and pollutant buildup and washoff rates as a function of LULC type. Whether using simple or complex models an accurate LULC dataset with an appropriate spatial resolution and level of detail is paramount for reliable predictions. The study presented in this paper compared and evaluated several LULC dataset sources for application in urban environmental modeling. The commonly used USGS LULC datasets have coarser spatial resolution and lower levels of classification than other LULC datasets. In addition, the USGS datasets do not accurately represent the land use in areas that have undergone significant land use change during the past two decades. We performed a watershed modeling analysis of three urban catchments in Los Angeles, California, USA to investigate the relative difference in average annual runoff volumes and total suspended solids (TSS) loads when using the USGS LULC dataset versus using a more detailed and current LULC dataset. When the two LULC datasets were aggregated to the same land use categories, the relative differences in predicted average annual runoff volumes and TSS loads from the three catchments were 8 to 14% and 13 to 40%, respectively. The relative differences did not have a predictable relationship with catchment size
Sharing Video Datasets in Design Research

DEFF Research Database (Denmark)

Christensen, Bo; Abildgaard, Sille Julie Jøhnk

2017-01-01

This paper examines how design researchers, design practitioners and design education can benefit from sharing a dataset. We present the Design Thinking Research Symposium 11 (DTRS11) as an exemplary project that implied sharing video data of design processes and design activity in natural settings...... with a large group of fellow academics from the international community of Design Thinking Research, for the purpose of facilitating research collaboration and communication within the field of Design and Design Thinking. This approach emphasizes the social and collaborative aspects of design research, where...... a multitude of appropriate perspectives and methods may be utilized in analyzing and discussing the singular dataset. The shared data is, from this perspective, understood as a design object in itself, which facilitates new ways of working, collaborating, studying, learning and educating within the expanding...
Using NASA Satellite Aerosol Optical Depth to Enhance PM2.5 Concentration Datasets for Use in Human Health and Epidemiology Studies

Science.gov (United States)

Huff, A. K.; Weber, S.; Braggio, J.; Talbot, T.; Hall, E.

2012-12-01

Fine particulate matter (PM2.5) is a criterion air pollutant, and its adverse impacts on human health are well established. Traditionally, studies that analyze the health effects of human exposure to PM2.5 use concentration measurements from ground-based monitors and predicted PM2.5 concentrations from air quality models, such as the U.S. EPA's Community Multi-scale Air Quality (CMAQ) model. There are shortcomings associated with these datasets, however. Monitors are not distributed uniformly across the U.S., which causes spatially inhomogeneous measurements of pollutant concentrations. There are often temporal variations as well, since not all monitors make daily measurements. Air quality model output, while spatially and temporally uniform, represents predictions of PM2.5 concentrations, not actual measurements. This study is exploring the potential of combining Aerosol Optical Depth (AOD) data from the MODIS instrument on NASA's Terra and Aqua satellites with PM2.5 monitor data and CMAQ predictions to create PM2.5 datasets that more accurately reflect the spatial and temporal variations in ambient PM2.5 concentrations on the metropolitan scale, with the overall goal of enhancing capabilities for environmental public health decision-making. AOD data provide regional information about particulate concentrations that can fill in the spatial and temporal gaps in the national PM2.5 monitor network. Furthermore, AOD is a measurement, so it reflects actual concentrations of particulates in the atmosphere, in contrast to PM2.5 predictions from air quality models. Results will be presented from the Battelle/U.S. EPA statistical Hierarchical Bayesian Model (HBM), which was used to combine three PM2.5 concentration datasets: monitor measurements, AOD data, and CMAQ model predictions. The study is focusing on the Baltimore, MD and New York City, NY metropolitan regions for the period 2004-2006. For each region, combined monitor/AOD/CMAQ PM2.5 datasets generated by the HBM
Development of a SPARK Training Dataset

International Nuclear Information System (INIS)

Sayre, Amanda M.; Olson, Jarrod R.

2015-01-01

In its first five years, the National Nuclear Security Administration's (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and across locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK's intended analysis capability. The analysis demonstration sought to answer
An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

Directory of Open Access Journals (Sweden)

Hubbard Alan E

2010-06-01

Full Text Available Abstract Background As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited. Results Using a multiple sclerosis (MS case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD, and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies. Conclusions This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.
ClimateNet: A Machine Learning dataset for Climate Science Research

Science.gov (United States)

Prabhat, M.; Biard, J.; Ganguly, S.; Ames, S.; Kashinath, K.; Kim, S. K.; Kahou, S.; Maharaj, T.; Beckham, C.; O'Brien, T. A.; Wehner, M. F.; Williams, D. N.; Kunkel, K.; Collins, W. D.

2017-12-01

Deep Learning techniques have revolutionized commercial applications in Computer vision, speech recognition and control systems. The key for all of these developments was the creation of a curated, labeled dataset ImageNet, for enabling multiple research groups around the world to develop methods, benchmark performance and compete with each other. The success of Deep Learning can be largely attributed to the broad availability of this dataset. Our empirical investigations have revealed that Deep Learning is similarly poised to benefit the task of pattern detection in climate science. Unfortunately, labeled datasets, a key pre-requisite for training, are hard to find. Individual research groups are typically interested in specialized weather patterns, making it hard to unify, and share datasets across groups and institutions. In this work, we are proposing ClimateNet: a labeled dataset that provides labeled instances of extreme weather patterns, as well as associated raw fields in model and observational output. We develop a schema in NetCDF to enumerate weather pattern classes/types, store bounding boxes, and pixel-masks. We are also working on a TensorFlow implementation to natively import such NetCDF datasets, and are providing a reference convolutional architecture for binary classification tasks. Our hope is that researchers in Climate Science, as well as ML/DL, will be able to use (and extend) ClimateNet to make rapid progress in the application of Deep Learning for Climate Science research.
Resampling Methods Improve the Predictive Power of Modeling in Class-Imbalanced Datasets

Directory of Open Access Journals (Sweden)

Paul H. Lee

2014-09-01

Full Text Available In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES wave 2009–2010 dataset. A total of 4677 participants aged ≥20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset, and area under the receiver operating characteristic curve (AUC was computed using the remaining 30% of the sample for evaluation (testing dataset. CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees were also examined. CARTs fitted on the oversampled (AUC = 0.70 and undersampled training data (AUC = 0.74 yielded a better classification power than that on the training data (AUC = 0.65. Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests
BASE MAP DATASET, INYO COUNTY, OKLAHOMA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, JACKSON COUNTY, OKLAHOMA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, KINGFISHER COUNTY, OKLAHOMA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
Testing the efficacy of downscaling in species distribution modelling: a comparison between MaxEnt and Favourability Function models

Directory of Open Access Journals (Sweden)

Olivero, J.

2016-03-01

Full Text Available Statistical downscaling is used to improve the knowledge of spatial distributions from broad–scale to fine–scale maps with higher potential for conservation planning. We assessed the effectiveness of downscaling in two commonly used species distribution models: Maximum Entropy (MaxEnt and the Favourability Function (FF. We used atlas data (10 x 10 km of the fire salamander Salamandra salamandra distribution in southern Spain to derive models at a 1 x 1 km resolution. Downscaled models were assessed using an independent dataset of the species’ distribution at 1 x 1 km. The Favourability model showed better downscaling performance than the MaxEnt model, and the models that were based on linear combinations of environmental variables performed better than models allowing higher flexibility. The Favourability model minimized model overfitting compared to the MaxEnt model.
Apache Flume distributed log collection for Hadoop

CERN Document Server

D'Souza, Subas

2013-01-01

A starter guide that covers Apache Flume in detail.Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators

The potential distribution of bioenergy crops in Europe under present and future climate

International Nuclear Information System (INIS)

Tuck, Gill; Glendining, Margaret J.; Smith, Pete; Wattenbach, Martin; House, Jo I.

2006-01-01

We have derived maps of the potential distribution of 26 promising bioenergy crops in Europe, based on simple rules for suitable climatic conditions and elevation. Crops suitable for temperate and Mediterranean climates were selected from four groups: oilseeds (e.g. oilseed rape, sunflower), starch crops (e.g. potatoes), cereals (e.g. barley) and solid biofuel crops (e.g. sorghum, Miscanthus). The impact of climate change under different scenarios and GCMs on the potential future distribution of these crops was determined, based on predicted future climatic conditions. Climate scenarios based on four IPCC SRES emission scenarios, A1FI, A2, B1 and B2, implemented by four global climate models, HadCM3, CSIRO2, PCM and CGCM2, were used. The potential distribution of temperate oilseeds, cereals, starch crops and solid biofuels is predicted to increase in northern Europe by the 2080s, due to increasing temperatures, and decrease in southern Europe (e.g. Spain, Portugal, southern France, Italy, and Greece) due to increased drought. Mediterranean oil and solid biofuel crops, currently restricted to southern Europe, are predicted to extend further north due to higher summer temperatures. Effects become more pronounced with time and are greatest under the A1FI scenario and for models predicting the greatest climate forcing. Different climate models produce different regional patterns. All models predict that bioenergy crop production in Spain is especially vulnerable to climate change, with many temperate crops predicted to decline dramatically by the 2080s. The choice of bioenergy crops in southern Europe will be severely reduced in future unless measures are taken to adapt to climate change. (author)
A dataset from bottom trawl survey around Taiwan

Directory of Open Access Journals (Sweden)

Kwang-tsao Shao

2012-05-01

Full Text Available Bottom trawl fishery is one of the most important coastal fisheries in Taiwan both in production and economic values. However, its annual production started to decline due to overfishing since the 1980s. Its bycatch problem also damages the fishery resource seriously. Thus, the government banned the bottom fishery within 3 nautical miles along the shoreline in 1989. To evaluate the effectiveness of this policy, a four year survey was conducted from 2000–2003, in the waters around Taiwan and Penghu (Pescadore Islands, one region each year respectively. All fish specimens collected from trawling were brought back to lab for identification, individual number count and body weight measurement. These raw data have been integrated and established in Taiwan Fish Database (http://fishdb.sinica.edu.tw. They have also been published through TaiBIF (http://taibif.tw, FishBase and GBIF (website see below. This dataset contains 631 fish species and 3,529 records, making it the most complete demersal fish fauna and their temporal and spatial distributional data on the soft marine habitat in Taiwan.
Image segmentation evaluation for very-large datasets

Science.gov (United States)

Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

2016-03-01

With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.
Sea Surface Temperature for Climate Applications: A New Dataset from the European Space Agency Climate Change Initiative

Science.gov (United States)

Merchant, C. J.; Hulley, G. C.

2013-12-01

There are many datasets describing the evolution of global sea surface temperature (SST) over recent decades -- so why make another one? Answer: to provide observations of SST that have particular qualities relevant to climate applications: independence, accuracy and stability. This has been done within the European Space Agency (ESA) Climate Change Initative (CCI) project on SST. Independence refers to the fact that the new SST CCI dataset is not derived from or tuned to in situ observations. This matters for climate because the in situ observing network used to assess marine climate change (1) was not designed to monitor small changes over decadal timescales, and (2) has evolved significantly in its technology and mix of types of observation, even during the past 40 years. The potential for significant artefacts in our picture of global ocean surface warming is clear. Only by having an independent record can we confirm (or refute) that the work done to remove biases/trend artefacts in in-situ datasets has been successful. Accuracy is the degree to which SSTs are unbiased. For climate applications, a common accuracy target is 0.1 K for all regions of the ocean. Stability is the degree to which the bias, if any, in a dataset is constant over time. Long-term instability introduces trend artefacts. To observe trends of the magnitude of 'global warming', SST datasets need to be stable to <5 mK/year. The SST CCI project has produced a satellite-based dataset that addresses these characteristics relevant to climate applications. Satellite radiances (brightness temperatures) have been harmonised exploiting periods of overlapping observations between sensors. Less well-characterised sensors have had their calibration tuned to that of better characterised sensors (at radiance level). Non-conventional retrieval methods (optimal estimation) have been employed to reduce regional biases to the 0.1 K level, a target violated in most satellite SST datasets. Models for
Measuring Geographic Distribution of Economic Activity in Nigeria ...

African Journals Online (AJOL)

USER

For example, the outcome of this study could help further development of ... selection of the years was informed by the availability of gridded population data. The dataset .... slight difference in the directional distribution of the economic activity.
A New Dataset Size Reduction Approach for PCA-Based Classification in OCR Application

Directory of Open Access Journals (Sweden)

Mohammad Amin Shayegan

2014-01-01

Full Text Available A major problem of pattern recognition systems is due to the large volume of training datasets including duplicate and similar training samples. In order to overcome this problem, some dataset size reduction and also dimensionality reduction techniques have been introduced. The algorithms presently used for dataset size reduction usually remove samples near to the centers of classes or support vector samples between different classes. However, the samples near to a class center include valuable information about the class characteristics and the support vector is important for evaluating system efficiency. This paper reports on the use of Modified Frequency Diagram technique for dataset size reduction. In this new proposed technique, a training dataset is rearranged and then sieved. The sieved training dataset along with automatic feature extraction/selection operation using Principal Component Analysis is used in an OCR application. The experimental results obtained when using the proposed system on one of the biggest handwritten Farsi/Arabic numeral standard OCR datasets, Hoda, show about 97% accuracy in the recognition rate. The recognition speed increased by 2.28 times, while the accuracy decreased only by 0.7%, when a sieved version of the dataset, which is only as half as the size of the initial training dataset, was used.
[Projection of potential geographic distribution of Apocynum venetum under climate change in northern China].

Science.gov (United States)

Yang, Hui-Feng; Zheng, Jiang-Hua; Jia, Xiao-Guang; Li, Xiao-Jin

2017-03-01

Apocynum venetum belongs to apocynaceae and is a perennial medicinal plant, its stem is an important textile raw materials. The projection of potential geographic distribution of A. venetum has an important significance for the protection and sustainable utilization of the plant. This study was conducted to determine the potential geographic distribution of A. venetum and to project how climate change would affect its geographic distribution. The projection geographic distribution of A. venetum under current bioclimatic conditions in northern China was simulated using MaxEnt software based on species presence data at 44 locations and 19 bioclimatic parameters. The future distributions of A. venetum were also projected in 2050 and 2070 under the climate change scenarios of RCP2.6 and RCP8.5 described in 5th Assessment Report of the Intergovernmental Panel on Climate Change (IPCC). The result showed that min air temperature of the coldest month, annual mean air temperature, precipitation of the coldest quarter and mean air temperature of the wettest quarter dominated the geographic distribution of A. venetum. Under current climate, the suitable habitats of A. venetum is 11.94% in China, the suitable habitats are mainly located in the middle of Xinjiang, in the northern part of Gansu, in the southern part of Neimeng, in the northern part of Ningxia, in the middle and northern part of Shaanxi, in the southern part of Shanxi, in the middle and northern part of Henan, in the middle and southern part of Hebei, Shandong, Tianjin, in the southern part of Liaoning and part of Beijing. From 2050 to 2070, the model outputs indicated that the suitable habitats of A. venetum would decrease under the climate change scenarios of RCP2.6 and RCP8.5. Copyright© by the Chinese Pharmaceutical Association.
The CMS dataset bookkeeping service

Science.gov (United States)

Afaq, A.; Dolgert, A.; Guo, Y.; Jones, C.; Kosyakov, S.; Kuznetsov, V.; Lueking, L.; Riley, D.; Sekhri, V.

2008-07-01

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.
The CMS dataset bookkeeping service

Energy Technology Data Exchange (ETDEWEB)

Afaq, A; Guo, Y; Kosyakov, S; Lueking, L; Sekhri, V [Fermilab, Batavia, Illinois 60510 (United States); Dolgert, A; Jones, C; Kuznetsov, V; Riley, D [Cornell University, Ithaca, New York 14850 (United States)

2008-07-15

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.
The CMS dataset bookkeeping service

International Nuclear Information System (INIS)

Afaq, A; Guo, Y; Kosyakov, S; Lueking, L; Sekhri, V; Dolgert, A; Jones, C; Kuznetsov, V; Riley, D

2008-01-01

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems
The CMS dataset bookkeeping service

International Nuclear Information System (INIS)

Afaq, Anzar; Dolgert, Andrew; Guo, Yuyi; Jones, Chris; Kosyakov, Sergey; Kuznetsov, Valentin; Lueking, Lee; Riley, Dan; Sekhri, Vijay

2007-01-01

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems
Potential fluctuations due to randomly distributed charges at the semiconductor-insulator interface in MIS-structures

International Nuclear Information System (INIS)

Yanchev, I.

2003-01-01

A new expression for the Fourier transform of the binary correlation function of the random potential near the semiconductor-insulator interface is derived. The screening from the metal electrode in MIS-structure is taken into account introducing an effective insulator thickness. An essential advantage of this correlation function is the finite dispersion of the random potential to which it leads in distinction with the so far known correlation functions leading to a divergent dispersion. The dispersion, an important characteristic of the random potential distribution, determining the amplitude of the potential fluctuations is calculated
Potential fluctuations due to randomly distributed charges at the semiconductor-insulator interface in MIS-structures

CERN Document Server

Yanchev, I

2003-01-01

A new expression for the Fourier transform of the binary correlation function of the random potential near the semiconductor-insulator interface is derived. The screening from the metal electrode in MIS-structure is taken into account introducing an effective insulator thickness. An essential advantage of this correlation function is the finite dispersion of the random potential to which it leads in distinction with the so far known correlation functions leading to a divergent dispersion. The dispersion, an important characteristic of the random potential distribution, determining the amplitude of the potential fluctuations is calculated.
Potential fluctuations due to randomly distributed charges at the semiconductor-insulator interface in MIS-structures

Energy Technology Data Exchange (ETDEWEB)

Yanchev, I

2003-07-01

A new expression for the Fourier transform of the binary correlation function of the random potential near the semiconductor-insulator interface is derived. The screening from the metal electrode in MIS-structure is taken into account introducing an effective insulator thickness. An essential advantage of this correlation function is the finite dispersion of the random potential to which it leads in distinction with the so far known correlation functions leading to a divergent dispersion. The dispersion, an important characteristic of the random potential distribution, determining the amplitude of the potential fluctuations is calculated.
Historical gridded reconstruction of potential evapotranspiration for the UK

Science.gov (United States)

Tanguy, Maliko; Prudhomme, Christel; Smith, Katie; Hannaford, Jamie

2018-06-01

Potential evapotranspiration (PET) is a necessary input data for most hydrological models and is often needed at a daily time step. An accurate estimation of PET requires many input climate variables which are, in most cases, not available prior to the 1960s for the UK, nor indeed most parts of the world. Therefore, when applying hydrological models to earlier periods, modellers have to rely on PET estimations derived from simplified methods. Given that only monthly observed temperature data is readily available for the late 19th and early 20th century at a national scale for the UK, the objective of this work was to derive the best possible UK-wide gridded PET dataset from the limited data available.To that end, firstly, a combination of (i) seven temperature-based PET equations, (ii) four different calibration approaches and (iii) seven input temperature data were evaluated. For this evaluation, a gridded daily PET product based on the physically based Penman-Monteith equation (the CHESS PET dataset) was used, the rationale being that this provides a reliable ground truth PET dataset for evaluation purposes, given that no directly observed, distributed PET datasets exist. The performance of the models was also compared to a naïve method, which is defined as the simplest possible estimation of PET in the absence of any available climate data. The naïve method used in this study is the CHESS PET daily long-term average (the period from 1961 to 1990 was chosen), or CHESS-PET daily climatology.The analysis revealed that the type of calibration and the input temperature dataset had only a minor effect on the accuracy of the PET estimations at catchment scale. From the seven equations tested, only the calibrated version of the McGuinness-Bordne equation was able to outperform the naïve method and was therefore used to derive the gridded, reconstructed dataset. The equation was calibrated using 43 catchments across Great Britain.The dataset produced is a 5 km gridded
Impact of climate change on potential distribution of Chinese caterpillar fungus (Ophiocordyceps sinensis) in Nepal Himalaya.

Science.gov (United States)

Shrestha, Uttam Babu; Bawa, Kamaljit S

2014-01-01

Climate change has already impacted ecosystems and species and substantial impacts of climate change in the future are expected. Species distribution modeling is widely used to map the current potential distribution of species as well as to model the impact of future climate change on distribution of species. Mapping current distribution is useful for conservation planning and understanding the change in distribution impacted by climate change is important for mitigation of future biodiversity losses. However, the current distribution of Chinese caterpillar fungus, a flagship species of the Himalaya with very high economic value, is unknown. Nor do we know the potential changes in suitable habitat of Chinese caterpillar fungus caused by future climate change. We used MaxEnt modeling to predict current distribution and changes in the future distributions of Chinese caterpillar fungus in three future climate change trajectories based on representative concentration pathways (RCPs: RCP 2.6, RCP 4.5, and RCP 6.0) in three different time periods (2030, 2050, and 2070) using species occurrence points, bioclimatic variables, and altitude. About 6.02% (8,989 km2) area of the Nepal Himalaya is suitable for Chinese caterpillar fungus habitat. Our model showed that across all future climate change trajectories over three different time periods, the area of predicted suitable habitat of Chinese caterpillar fungus would expand, with 0.11-4.87% expansion over current suitable habitat. Depending upon the representative concentration pathways, we observed both increase and decrease in average elevation of the suitable habitat range of the species.
A cross-country Exchange Market Pressure (EMP dataset

Directory of Open Access Journals (Sweden)

Mohit Desai

2017-06-01

Full Text Available The data presented in this article are related to the research article titled - “An exchange market pressure measure for cross country analysis” (Patnaik et al. [1]. In this article, we present the dataset for Exchange Market Pressure values (EMP for 139 countries along with their conversion factors, ρ (rho. Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values for the point estimates of ρ’s. Using the standard errors of estimates of ρ’s, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.
A cross-country Exchange Market Pressure (EMP) dataset.

Science.gov (United States)

Desai, Mohit; Patnaik, Ila; Felman, Joshua; Shah, Ajay

2017-06-01

The data presented in this article are related to the research article titled - "An exchange market pressure measure for cross country analysis" (Patnaik et al. [1]). In this article, we present the dataset for Exchange Market Pressure values (EMP) for 139 countries along with their conversion factors, ρ (rho). Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values) for the point estimates of ρ 's. Using the standard errors of estimates of ρ 's, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.
Westslope Cutthroat Trout Distribution, Pacific Northwest (updated March, 2006)

Data.gov (United States)

Pacific States Marine Fisheries Commission — This dataset is a record of fish distribution and activity for WESTSLOPE CUTTHROAT TROUT contained in the StreamNet database. This feature class was created based on...
Spatial datasets of radionuclide contamination in the Ukrainian Chernobyl Exclusion Zone

Science.gov (United States)

Kashparov, Valery; Levchuk, Sviatoslav; Zhurba, Marina; Protsak, Valentyn; Khomutinin, Yuri; Beresford, Nicholas A.; Chaplow, Jacqueline S.

2018-02-01

The dataset Spatial datasets of radionuclide contamination in the Ukrainian Chernobyl Exclusion Zone was developed to enable data collected between May 1986 (immediately after Chernobyl) and 2014 by the Ukrainian Institute of Agricultural Radiology (UIAR) after the Chernobyl accident to be made publicly available. The dataset includes results from comprehensive soil sampling across the Chernobyl Exclusion Zone (CEZ). Analyses include radiocaesium (134Cs and 134Cs) 90Sr, 154Eu and soil property data; plutonium isotope activity concentrations in soil (including distribution in the soil profile); analyses of hot (or fuel) particles from the CEZ (data from Poland and across Europe are also included); and results of monitoring in the Ivankov district, a region adjacent to the exclusion zone. The purpose of this paper is to describe the available data and methodology used to obtain them. The data will be valuable to those conducting studies within the CEZ in a number of ways, for instance (i) for helping to perform robust exposure estimates to wildlife, (ii) for predicting comparative activity concentrations of different key radionuclides, (iii) for providing a baseline against which future surveys in the CEZ can be compared, (iv) as a source of information on the behaviour of fuel particles (FPs), (v) for performing retrospective dose assessments and (vi) for assessing natural background dose rates in the CEZ. The CEZ has been proposed as a radioecological observatory (i.e. a radioactively contaminated site that will provide a focus for long-term, radioecological collaborative international research). Key to the future success of this concept is open access to data for the CEZ. The data presented here are a first step in this process. The data and supporting documentation are freely available from the Environmental Information Data Centre (EIDC) under the terms and conditions of the Open Government Licence: https://doi.org/10.5285/782ec845-2135-4698-8881-b38823e533bf.

Spatial datasets of radionuclide contamination in the Ukrainian Chernobyl Exclusion Zone

Directory of Open Access Journals (Sweden)

V. Kashparov

2018-02-01

Full Text Available The dataset Spatial datasets of radionuclide contamination in the Ukrainian Chernobyl Exclusion Zone was developed to enable data collected between May 1986 (immediately after Chernobyl and 2014 by the Ukrainian Institute of Agricultural Radiology (UIAR after the Chernobyl accident to be made publicly available. The dataset includes results from comprehensive soil sampling across the Chernobyl Exclusion Zone (CEZ. Analyses include radiocaesium (134Cs and 134Cs 90Sr, 154Eu and soil property data; plutonium isotope activity concentrations in soil (including distribution in the soil profile; analyses of hot (or fuel particles from the CEZ (data from Poland and across Europe are also included; and results of monitoring in the Ivankov district, a region adjacent to the exclusion zone. The purpose of this paper is to describe the available data and methodology used to obtain them. The data will be valuable to those conducting studies within the CEZ in a number of ways, for instance (i for helping to perform robust exposure estimates to wildlife, (ii for predicting comparative activity concentrations of different key radionuclides, (iii for providing a baseline against which future surveys in the CEZ can be compared, (iv as a source of information on the behaviour of fuel particles (FPs, (v for performing retrospective dose assessments and (vi for assessing natural background dose rates in the CEZ. The CEZ has been proposed as a radioecological observatory (i.e. a radioactively contaminated site that will provide a focus for long-term, radioecological collaborative international research. Key to the future success of this concept is open access to data for the CEZ. The data presented here are a first step in this process. The data and supporting documentation are freely available from the Environmental Information Data Centre (EIDC under the terms and conditions of the Open Government Licence: https://doi.org/10.5285/782ec845-2135-4698-8881-b
Privacy-Preserving Matching of Spatial Datasets with Protection against Background Knowledge

DEFF Research Database (Denmark)

Ghinita, Gabriel; Vicente, Carmen Ruiz; Shang, Ning

2010-01-01

should be disclosed. Previous research efforts focused on private matching for relational data, and rely either on spaceembedding or on SMC techniques. Space-embedding transforms data points to hide their exact attribute values before matching is performed, whereas SMC protocols simulate complex digital...... circuits that evaluate the matching condition without revealing anything else other than the matching outcome. However, existing solutions have at least one of the following drawbacks: (i) they fail to protect against adversaries with background knowledge on data distribution, (ii) they compromise privacy...... by returning large amounts of false positives and (iii) they rely on complex and expensive SMC protocols. In this paper, we introduce a novel geometric transformation to perform private matching on spatial datasets. Our method is efficient and it is not vulnerable to background knowledge attacks. We consider...
Comparisons of Supergranule Properties from SDO/HMI with Other Datasets

Science.gov (United States)

Pesnell, William Dean; Williams, Peter E.

2010-01-01

While supergranules, a component of solar convection, have been well studied through the use of Dopplergrams, other datasets also exhibit these features. Quiet Sun magnetograms show local magnetic field elements distributed around the boundaries of supergranule cells, notably clustering at the common apex points of adjacent cells, while more solid cellular features are seen near active regions. Ca II K images are notable for exhibiting the chromospheric network representing a cellular distribution of local magnetic field lines across the solar disk that coincides with supergranulation boundaries. Measurements at 304 A further above the solar surface also show a similar pattern to the chromospheric network, but the boundaries are more nebulous in nature. While previous observations of these different solar features were obtained with a variety of instruments, SDO provides a single platform, from which the relevant data products at a high cadence and high-definition image quality are delivered. The images may also be cross-referenced due to their coincidental time of observation. We present images of these different solar features from HMI & AIA and use them to make composite images of supergranules at different atmospheric layers in which they manifest. We also compare each data product to equivalent data from previous observations, for example HMI magnetograms with those from MDI.
How will climate change affect the potential distribution of Eurasian Tree Sparrows Passer montanus in North America?

Science.gov (United States)

Graham, Jim; Jarnevich, Catherine; Young, Nick; Newman, Greg; Stohlgren, Thomas

2011-01-01

Habitat suitability models have been used to predict the present and future potential distribution of a variety of species. Eurasian tree sparrows Passer montanus, native to Eurasia, have established populations in other parts of the world. In North America, their current distribution is limited to a relatively small region around its original introduction to St. Louis, Missouri. We combined data from the Global Biodiversity Information Facility with current and future climate data to create habitat suitability models using Maxent for this species. Under projected climate change scenarios, our models show that the distribution and range of the Eurasian tree sparrow could increase as far as the Pacific Northwest and Newfoundland. This is potentially important information for prioritizing the management and control of this non-native species.
Knowledge Mining from Clinical Datasets Using Rough Sets and Backpropagation Neural Network

Directory of Open Access Journals (Sweden)

Kindie Biredagn Nahato

2015-01-01

Full Text Available The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.
EnviroAtlas - Potential Wetland Areas - Contiguous United States

Data.gov (United States)

U.S. Environmental Protection Agency — The EnviroAtlas Potential Wetland Areas (PWA) dataset shows potential wetland areas at 30-meter resolution. Beginning two centuries ago, many wetlands were turned...
Benchmarking undedicated cloud computing providers for analysis of genomic datasets.

Science.gov (United States)

Yazar, Seyhan; Gooden, George E C; Mackey, David A; Hewitt, Alex W

2014-01-01

A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5-78.2) for E.coli and 53.5% (95% CI: 34.4-72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5-303.1) and 173.9% (95% CI: 134.6-213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.
Benchmarking undedicated cloud computing providers for analysis of genomic datasets.

Directory of Open Access Journals (Sweden)

Seyhan Yazar

Full Text Available A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR on Amazon EC2 instances and Google Compute Engine (GCE, using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5-78.2 for E.coli and 53.5% (95% CI: 34.4-72.6 for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5-303.1 and 173.9% (95% CI: 134.6-213.1 more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.
The Global Precipitation Climatology Project (GPCP) Combined Precipitation Dataset

Science.gov (United States)

Huffman, George J.; Adler, Robert F.; Arkin, Philip; Chang, Alfred; Ferraro, Ralph; Gruber, Arnold; Janowiak, John; McNab, Alan; Rudolf, Bruno; Schneider, Udo

1997-01-01

The Global Precipitation Climatology Project (GPCP) has released the GPCP Version 1 Combined Precipitation Data Set, a global, monthly precipitation dataset covering the period July 1987 through December 1995. The primary product in the dataset is a merged analysis incorporating precipitation estimates from low-orbit-satellite microwave data, geosynchronous-orbit -satellite infrared data, and rain gauge observations. The dataset also contains the individual input fields, a combination of the microwave and infrared satellite estimates, and error estimates for each field. The data are provided on 2.5 deg x 2.5 deg latitude-longitude global grids. Preliminary analyses show general agreement with prior studies of global precipitation and extends prior studies of El Nino-Southern Oscillation precipitation patterns. At the regional scale there are systematic differences with standard climatologies.
Future potential distribution of the emerging amphibian chytrid fungus under anthropogenic climate change.

Science.gov (United States)

Rödder, Dennis; Kielgast, Jos; Lötters, Stefan

2010-11-01

Anthropogenic climate change poses a major threat to global biodiversity with a potential to alter biological interactions at all spatial scales. Amphibians are the most threatened vertebrates and have been subject to increasing conservation attention over the past decade. A particular concern is the pandemic emergence of the parasitic chytrid fungus Batrachochytrium dendrobatidis, which has been identified as the cause of extremely rapid large-scale declines and species extinctions. Experimental and observational studies have demonstrated that the host-pathogen system is strongly influenced by climatic parameters and thereby potentially affected by climate change. Herein we project a species distribution model of the pathogen onto future climatic scenarios generated by the IPCC to examine their potential implications on the pandemic. Results suggest that predicted anthropogenic climate change may reduce the geographic range of B. dendrobatidis and its potential influence on amphibian biodiversity.
A new dataset and algorithm evaluation for mood estimation in music

OpenAIRE

Godec, Primož

2014-01-01

This thesis presents a new dataset of perceived and induced emotions for 200 audio clips. The gathered dataset provides users' perceived and induced emotions for each clip, the association of color, along with demographic and personal data, such as user's emotion state and emotion ratings, genre preference, music experience, among others. With an online survey we collected more than 7000 responses for a dataset of 200 audio excerpts, thus providing about 37 user responses per clip. The foc...
A GIS model predicting potential distributions of a lineage: a test case on hermit spiders (Nephilidae: Nephilengys).

Science.gov (United States)

Năpăruş, Magdalena; Kuntner, Matjaž

2012-01-01

Although numerous studies model species distributions, these models are almost exclusively on single species, while studies of evolutionary lineages are preferred as they by definition study closely related species with shared history and ecology. Hermit spiders, genus Nephilengys, represent an ecologically important but relatively species-poor lineage with a globally allopatric distribution. Here, we model Nephilengys global habitat suitability based on known localities and four ecological parameters. We geo-referenced 751 localities for the four most studied Nephilengys species: N. cruentata (Africa, New World), N. livida (Madagascar), N. malabarensis (S-SE Asia), and N. papuana (Australasia). For each locality we overlaid four ecological parameters: elevation, annual mean temperature, annual mean precipitation, and land cover. We used linear backward regression within ArcGIS to select two best fit parameters per species model, and ModelBuilder to map areas of high, moderate and low habitat suitability for each species within its directional distribution. For Nephilengys cruentata suitable habitats are mid elevation tropics within Africa (natural range), a large part of Brazil and the Guianas (area of synanthropic spread), and even North Africa, Mediterranean, and Arabia. Nephilengys livida is confined to its known range with suitable habitats being mid-elevation natural and cultivated lands. Nephilengys malabarensis, however, ranges across the Equator throughout Asia where the model predicts many areas of high ecological suitability in the wet tropics. Its directional distribution suggests the species may potentially spread eastwards to New Guinea where the suitable areas of N. malabarensis largely surpass those of the native N. papuana, a species that prefers dry forests of Australian (sub)tropics. Our model is a customizable GIS tool intended to predict current and future potential distributions of globally distributed terrestrial lineages. Its predictive
A GIS model predicting potential distributions of a lineage: a test case on hermit spiders (Nephilidae: Nephilengys.

Directory of Open Access Journals (Sweden)

Magdalena Năpăruş

Full Text Available BACKGROUND: Although numerous studies model species distributions, these models are almost exclusively on single species, while studies of evolutionary lineages are preferred as they by definition study closely related species with shared history and ecology. Hermit spiders, genus Nephilengys, represent an ecologically important but relatively species-poor lineage with a globally allopatric distribution. Here, we model Nephilengys global habitat suitability based on known localities and four ecological parameters. METHODOLOGY/PRINCIPAL FINDINGS: We geo-referenced 751 localities for the four most studied Nephilengys species: N. cruentata (Africa, New World, N. livida (Madagascar, N. malabarensis (S-SE Asia, and N. papuana (Australasia. For each locality we overlaid four ecological parameters: elevation, annual mean temperature, annual mean precipitation, and land cover. We used linear backward regression within ArcGIS to select two best fit parameters per species model, and ModelBuilder to map areas of high, moderate and low habitat suitability for each species within its directional distribution. For Nephilengys cruentata suitable habitats are mid elevation tropics within Africa (natural range, a large part of Brazil and the Guianas (area of synanthropic spread, and even North Africa, Mediterranean, and Arabia. Nephilengys livida is confined to its known range with suitable habitats being mid-elevation natural and cultivated lands. Nephilengys malabarensis, however, ranges across the Equator throughout Asia where the model predicts many areas of high ecological suitability in the wet tropics. Its directional distribution suggests the species may potentially spread eastwards to New Guinea where the suitable areas of N. malabarensis largely surpass those of the native N. papuana, a species that prefers dry forests of Australian (subtropics. CONCLUSIONS: Our model is a customizable GIS tool intended to predict current and future potential
Distributed Power Allocation for Wireless Sensor Network Localization: A Potential Game Approach.

Science.gov (United States)

Ke, Mingxing; Li, Ding; Tian, Shiwei; Zhang, Yuli; Tong, Kaixiang; Xu, Yuhua

2018-05-08

The problem of distributed power allocation in wireless sensor network (WSN) localization systems is investigated in this paper, using the game theoretic approach. Existing research focuses on the minimization of the localization errors of individual agent nodes over all anchor nodes subject to power budgets. When the service area and the distribution of target nodes are considered, finding the optimal trade-off between localization accuracy and power consumption is a new critical task. To cope with this issue, we propose a power allocation game where each anchor node minimizes the square position error bound (SPEB) of the service area penalized by its individual power. Meanwhile, it is proven that the power allocation game is an exact potential game which has one pure Nash equilibrium (NE) at least. In addition, we also prove the existence of an ϵ -equilibrium point, which is a refinement of NE and the better response dynamic approach can reach the end solution. Analytical and simulation results demonstrate that: (i) when prior distribution information is available, the proposed strategies have better localization accuracy than the uniform strategies; (ii) when prior distribution information is unknown, the performance of the proposed strategies outperforms power management strategies based on the second-order cone program (SOCP) for particular agent nodes after obtaining the estimated distribution of agent nodes. In addition, proposed strategies also provide an instructional trade-off between power consumption and localization accuracy.
A Large-Scale 3D Object Recognition dataset

DEFF Research Database (Denmark)

Sølund, Thomas; Glent Buch, Anders; Krüger, Norbert

2016-01-01

geometric groups; concave, convex, cylindrical and flat 3D object models. The object models have varying amount of local geometric features to challenge existing local shape feature descriptors in terms of descriptiveness and robustness. The dataset is validated in a benchmark which evaluates the matching...... performance of 7 different state-of-the-art local shape descriptors. Further, we validate the dataset in a 3D object recognition pipeline. Our benchmark shows as expected that local shape feature descriptors without any global point relation across the surface have a poor matching performance with flat...
The Wind Integration National Dataset (WIND) toolkit (Presentation)

Energy Technology Data Exchange (ETDEWEB)

Caroline Draxl: NREL

2014-01-01

Regional wind integration studies require detailed wind power output data at many locations to perform simulations of how the power system will operate under high penetration scenarios. The wind datasets that serve as inputs into the study must realistically reflect the ramping characteristics, spatial and temporal correlations, and capacity factors of the simulated wind plants, as well as being time synchronized with available load profiles.As described in this presentation, the WIND Toolkit fulfills these requirements by providing a state-of-the-art national (US) wind resource, power production and forecast dataset.
seNorge2 daily precipitation, an observational gridded dataset over Norway from 1957 to the present day

Science.gov (United States)

Lussana, Cristian; Saloranta, Tuomo; Skaugen, Thomas; Magnusson, Jan; Tveito, Ole Einar; Andersen, Jess

2018-02-01

The conventional climate gridded datasets based on observations only are widely used in atmospheric sciences; our focus in this paper is on climate and hydrology. On the Norwegian mainland, seNorge2 provides high-resolution fields of daily total precipitation for applications requiring long-term datasets at regional or national level, where the challenge is to simulate small-scale processes often taking place in complex terrain. The dataset constitutes a valuable meteorological input for snow and hydrological simulations; it is updated daily and presented on a high-resolution grid (1 km of grid spacing). The climate archive goes back to 1957. The spatial interpolation scheme builds upon classical methods, such as optimal interpolation and successive-correction schemes. An original approach based on (spatial) scale-separation concepts has been implemented which uses geographical coordinates and elevation as complementary information in the interpolation. seNorge2 daily precipitation fields represent local precipitation features at spatial scales of a few kilometers, depending on the station network density. In the surroundings of a station or in dense station areas, the predictions are quite accurate even for intense precipitation. For most of the grid points, the performances are comparable to or better than a state-of-the-art pan-European dataset (E-OBS), because of the higher effective resolution of seNorge2. However, in very data-sparse areas, such as in the mountainous region of southern Norway, seNorge2 underestimates precipitation because it does not make use of enough geographical information to compensate for the lack of observations. The evaluation of seNorge2 as the meteorological forcing for the seNorge snow model and the DDD (Distance Distribution Dynamics) rainfall-runoff model shows that both models have been able to make profitable use of seNorge2, partly because of the automatic calibration procedure they incorporate for precipitation. The seNorge2
An integrated pan-tropical biomass map using multiple reference datasets

NARCIS (Netherlands)

Avitabile, V.; Herold, M.; Heuvelink, G.B.M.; Lewis, S.L.; Phillips, O.L.; Asner, G.P.; Armston, J.; Asthon, P.; Banin, L.F.; Bayol, N.; Berry, N.; Boeckx, P.; Jong, De B.; Devries, B.; Girardin, C.; Kearsley, E.; Lindsell, J.A.; Lopez-gonzalez, G.; Lucas, R.; Malhi, Y.; Morel, A.; Mitchard, E.; Nagy, L.; Qie, L.; Quinones, M.; Ryan, C.M.; Slik, F.; Sunderland, T.; Vaglio Laurin, G.; Valentini, R.; Verbeeck, H.; Wijaya, A.; Willcock, S.

2016-01-01

We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of
Development and Application of Improved Long-Term Datasets of Surface Hydrology for Texas

Directory of Open Access Journals (Sweden)

Kyungtae Lee

2017-01-01

Full Text Available Freshwater availability and agricultural production are key factors for sustaining the fast growing population and economy in the state of Texas, which is the third largest state in terms of agricultural production in the United States. This paper describes a long-term (1918–2011 grid-based (1/8° surface hydrological dataset for Texas at a daily time step based on simulations from the Variable Infiltration Capacity (VIC hydrological model. The model was calibrated and validated against observed streamflow over 10 Texas river basins. The simulated soil moisture was also evaluated using in situ observations. Results suggest that there is a decreasing trend in precipitation and an increasing trend in temperature in most of the basins. Droughts and floods were reconstructed and analyzed. In particular, the spatially distributed severity and duration of major Texas droughts were compared to identify new characteristics. The modeled flood recurrence interval and the return period were also compared with observations. Results suggest the performance of extreme flood simulations needs further improvement. This dataset is expected to serve as a benchmark which may contribute to water resources management and to mitigating agricultural drought, especially in the context of understanding the effects of climate change on crop yield in Texas.
Potential fluctuations due to randomly distributed charges at the semiconductor-insulator interface in mis-structures

International Nuclear Information System (INIS)

Yanchev, I; Slavcheva, G.

1993-01-01

A new expression for the Fourier transform of the binary correlation function of the random potential near the semiconductor-insulator interface is derived. The screening from the metal electrode in MIS-structure is taken into account introducing an effective insulator thickness. An essential advantage of this correlation function is the finite dispersion of the random potential Γ 2 to which it leads in distinction with the so far known correlation functions leading to divergent dispersion. The important characteristic of the random potential distribution Γ 2 determining the amplitude of the potential fluctuations is calculated. 7 refs. (orig.)

Modeling the Potential Distribution of Picea chihuahuana Martínez, an Endangered Species at the Sierra Madre Occidental, Mexico

Directory of Open Access Journals (Sweden)

Victor Aguilar-Soto

2015-03-01

Full Text Available Species distribution models (SDMs help identify areas for the development of populations or communities to prevent extinctions, especially in the face of the global environmental change. This study modeled the potential distribution of the tree Picea chihuahuana Martínez, a species in danger of extinction, using the maximum entropy modeling method (MaxEnt at three scales: local, state and national. We used a total of 38 presence data from the Sierra Madre Occidental. At the local scale, we compared MaxEnt with the reclassification and overlay method integrated in a geographic information system. MaxEnt generated maps with a high predictive capability (AUC > 0.97. The distribution of P. chihuahuana is defined by vegetation type and minimum temperature at national and state scales. At the local scale, both models calculated similar areas for the potential distribution of the species; the variables that better defined the species distribution were vegetation type, aspect and distance to water flows. Populations of P. chihuahuana have always been small, but our results show potential habitat greater than the area of the actual distribution. These results provide an insight into the availability of areas suitable for the species’ regeneration, possibly through assisted colonization.
Charged patchy particle models in explicit salt: Ion distributions, electrostatic potentials, and effective interactions.

Science.gov (United States)

Yigit, Cemil; Heyda, Jan; Dzubiella, Joachim

2015-08-14

We introduce a set of charged patchy particle models (CPPMs) in order to systematically study the influence of electrostatic charge patchiness and multipolarity on macromolecular interactions by means of implicit-solvent, explicit-ion Langevin dynamics simulations employing the Gromacs software. We consider well-defined zero-, one-, and two-patched spherical globules each of the same net charge and (nanometer) size which are composed of discrete atoms. The studied mono- and multipole moments of the CPPMs are comparable to those of globular proteins with similar size. We first characterize ion distributions and electrostatic potentials around a single CPPM. Although angle-resolved radial distribution functions reveal the expected local accumulation and depletion of counter- and co-ions around the patches, respectively, the orientation-averaged electrostatic potential shows only a small variation among the various CPPMs due to space charge cancellations. Furthermore, we study the orientation-averaged potential of mean force (PMF), the number of accumulated ions on the patches, as well as the CPPM orientations along the center-to-center distance of a pair of CPPMs. We compare the PMFs to the classical Derjaguin-Verwey-Landau-Overbeek theory and previously introduced orientation-averaged Debye-Hückel pair potentials including dipolar interactions. Our simulations confirm the adequacy of the theories in their respective regimes of validity, while low salt concentrations and large multipolar interactions remain a challenge for tractable theoretical descriptions.
Present and potential distribution of Snub-nosed Monkey

DEFF Research Database (Denmark)

Nüchel, Jonas; Bøcher, Peder Klith; Svenning, Jens-Christian

are the Snub-nosed Monkeys (Rhinopithecus), a temperate-subtropical East Asian genus. We use species distribution modeling to assess the following question of key relevancy for conservation management of Rhinopithecus; 1. Which climatic factors determine the present distribution of Rhinopithecus within...... distribution of Rhinopithecus within the region, considering climate, habitat availability and the locations of nature reserves. Keywords: biodiversity, biogeography, conservation, China, snub-nosed monkey, rhinopithecus, primates, species distribution modeling...
Global Human Built-up And Settlement Extent (HBASE) Dataset From Landsat

Data.gov (United States)

National Aeronautics and Space Administration — The Global Human Built-up And Settlement Extent (HBASE) Dataset from Landsat is a global map of HBASE derived from the Global Land Survey (GLS) Landsat dataset for...
Performance of the CORDEX regional climate models in simulating offshore wind and wind potential

Science.gov (United States)

Kulkarni, Sumeet; Deo, M. C.; Ghosh, Subimal

2018-03-01

This study is oriented towards quantification of the skill addition by regional climate models (RCMs) in the parent general circulation models (GCMs) while simulating wind speed and wind potential with particular reference to the Indian offshore region. To arrive at a suitable reference dataset, the performance of wind outputs from three different reanalysis datasets is evaluated. The comparison across the RCMs and their corresponding parent GCMs is done on the basis of annual/seasonal wind statistics, intermodel bias, wind climatology, and classes of wind potential. It was observed that while the RCMs could simulate spatial variability of winds, well for certain subregions, they generally failed to replicate the overall spatial pattern, especially in monsoon and winter. Various causes of biases in RCMs were determined by assessing corresponding maps of wind vectors, surface temperature, and sea-level pressure. The results highlight the necessity to carefully assess the RCM-yielded winds before using them for sensitive applications such as coastal vulnerability and hazard assessment. A supplementary outcome of this study is in form of wind potential atlas, based on spatial distribution of wind classes. This could be beneficial in suitably identifying viable subregions for developing offshore wind farms by intercomparing both the RCM and GCM outcomes. It is encouraging that most of the RCMs and GCMs indicate that around 70% of the Indian offshore locations in monsoon would experience mean wind potential greater than 200 W/m2.
Overview and Meteorological Validation of the Wind Integration National Dataset toolkit

Energy Technology Data Exchange (ETDEWEB)

Draxl, C. [National Renewable Energy Laboratory (NREL), Golden, CO (United States); Hodge, B. M. [National Renewable Energy Laboratory (NREL), Golden, CO (United States); Clifton, A. [National Renewable Energy Laboratory (NREL), Golden, CO (United States); McCaa, J. [3TIER by VAisala, Seattle, WA (United States)

2015-04-13

The Wind Integration National Dataset (WIND) Toolkit described in this report fulfills these requirements, and constitutes a state-of-the-art national wind resource data set covering the contiguous United States from 2007 to 2013 for use in a variety of next-generation wind integration analyses and wind power planning. The toolkit is a wind resource data set, wind forecast data set, and wind power production and forecast data set derived from the Weather Research and Forecasting (WRF) numerical weather prediction model. WIND Toolkit data are available online for over 116,000 land-based and 10,000 offshore sites representing existing and potential wind facilities.
A global gas flaring black carbon emission rate dataset from 1994 to 2012

Science.gov (United States)

Huang, Kan; Fu, Joshua S.

2016-11-01

Global flaring of associated petroleum gas is a potential emission source of particulate matters (PM) and could be notable in some specific regions that are in urgent need of mitigation. PM emitted from gas flaring is mainly in the form of black carbon (BC), which is a strong short-lived climate forcer. However, BC from gas flaring has been neglected in most global/regional emission inventories and is rarely considered in climate modeling. Here we present a global gas flaring BC emission rate dataset for the period 1994-2012 in a machine-readable format. We develop a region-dependent gas flaring BC emission factor database based on the chemical compositions of associated petroleum gas at various oil fields. Gas flaring BC emission rates are estimated using this emission factor database and flaring volumes retrieved from satellite imagery. Evaluation using a chemical transport model suggests that consideration of gas flaring emissions can improve model performance. This dataset will benefit and inform a broad range of research topics, e.g., carbon budget, air quality/climate modeling, and environmental/human exposure.
The Lunar Source Disk: Old Lunar Datasets on a New CD-ROM

Science.gov (United States)

Hiesinger, H.

1998-01-01

A compilation of previously published datasets on CD-ROM is presented. This Lunar Source Disk is intended to be a first step in the improvement/expansion of the Lunar Consortium Disk, in order to create an "image-cube"-like data pool that can be easily accessed and might be useful for a variety of future lunar investigations. All datasets were transformed to a standard map projection that allows direct comparison of different types of information on a pixel-by pixel basis. Lunar observations have a long history and have been important to mankind for centuries, notably since the work of Plutarch and Galileo. As a consequence of centuries of lunar investigations, knowledge of the characteristics and properties of the Moon has accumulated over time. However, a side effect of this accumulation is that it has become more and more complicated for scientists to review all the datasets obtained through different techniques, to interpret them properly, to recognize their weaknesses and strengths in detail, and to combine them synoptically in geologic interpretations. Such synoptic geologic interpretations are crucial for the study of planetary bodies through remote-sensing data in order to avoid misinterpretation. In addition, many of the modem datasets, derived from Earth-based telescopes as well as from spacecraft missions, are acquired at different geometric and radiometric conditions. These differences make it challenging to compare or combine datasets directly or to extract information from different datasets on a pixel-by-pixel basis. Also, as there is no convention for the presentation of lunar datasets, different authors choose different map projections, depending on the location of the investigated areas and their personal interests. Insufficient or incomplete information on the map parameters used by different authors further complicates the reprojection of these datasets to a standard geometry. The goal of our efforts was to transfer previously published lunar
Operational Prediction of the Habitat Suitability Index (HSI) Distribution for Neon Flying Squid in Central North Pacific by Using FORA Dataset and a New Data Assimilation System SKUIDS

Science.gov (United States)

Igarashi, H.; Ishikawa, Y.; Wakamatsu, T.; Tanaka, Y.; Nishikawa, S.; Nishikawa, H.; Kamachi, M.; Kuragano, T.; Takatsuki, Y.; Fujii, Y.; Usui, N.; Toyoda, T.; Hirose, N.; Sakai, M.; Saitoh, S. I.; Imamura, Y.

2016-02-01

The neon flying squid (Ommastrephes bartramii) has a wide-spread distribution in subtropical and temperate waters in the North Pacific, which plays an important role in the pelagic ecosystem and is one of the major targets in Japanese squid fisheries. The main fishing areas for Japanese commercial vessels are located in the central North Pacific (35-45N, around the date line) in summer. In this study, we have developed several kinds of habitat suitability index (HSI) models of the neon flying squid for investigating the relationship between its potential habitat and the ocean state variations in the target area. For developing HSI models, we have used a new ocean reanalysis dataset FORA (4-dimensional variational Ocean Re-Analysis) produced by JAMSTEC/CEIST and MRI-JMA. The horizontal resolution is 0.1*0.1 degree of latitude and longitude with 54 vertical levels, which can provide realistic fields of 3-dimensional ocean circulation and environmental structures including meso-scale eddies. In addition, we have developed a new 4D-VAR (4-dimensional variational) ocean data assimilation system for predicting ocean environmental changes in the main fishing grounds. We call this system "SKUIDS" (Scalable Kit of Under-sea Information Delivery System). By using these prediction fields of temperature, salinity, sea surface height, horizontal current velocity, we produced daily HSI maps of the neon flying squid, and provided them to the Japanese commercial vessels in operation. Squid fishermen can access the web site for delivering the information of ocean environments in the fishing ground by using Inmarsat satellite communication on board, and show the predicted fields of subsurface temperatures and HSI. Here, we present the details of SKUIDS and the web-delivery system for squid fishery, and some preliminary results of the operational prediction.
Etalon (standard) for surface potential distribution produced by electric activity of the heart.

Science.gov (United States)

Szathmáry, V; Ruttkay-Nedecký, I

1981-01-01

The authors submit etalon (standard) equipotential maps as an aid in the evaluation of maps of surface potential distributions in living subjects. They were obtained by measuring potentials on the surface of an electrolytic tank shaped like the thorax. The individual etalon maps were determined in such a way that the parameters of the physical dipole forming the source of the electric field in the tank corresponded to the mean vectorcardiographic parameters measured in a healthy population sample. The technique also allows a quantitative estimate of the degree of non-dipolarity of the heart as the source of the electric field.
Gridded 5km GHCN-Daily Temperature and Precipitation Dataset, Version 1

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — The Gridded 5km GHCN-Daily Temperature and Precipitation Dataset (nClimGrid) consists of four climate variables derived from the GHCN-D dataset: maximum temperature,...
ENHANCED DATA DISCOVERABILITY FOR IN SITU HYPERSPECTRAL DATASETS

Directory of Open Access Journals (Sweden)

B. Rasaiah

2016-06-01

Full Text Available Field spectroscopic metadata is a central component in the quality assurance, reliability, and discoverability of hyperspectral data and the products derived from it. Cataloguing, mining, and interoperability of these datasets rely upon the robustness of metadata protocols for field spectroscopy, and on the software architecture to support the exchange of these datasets. Currently no standard for in situ spectroscopy data or metadata protocols exist. This inhibits the effective sharing of growing volumes of in situ spectroscopy datasets, to exploit the benefits of integrating with the evolving range of data sharing platforms. A core metadataset for field spectroscopy was introduced by Rasaiah et al., (2011-2015 with extended support for specific applications. This paper presents a prototype model for an OGC and ISO compliant platform-independent metadata discovery service aligned to the specific requirements of field spectroscopy. In this study, a proof-of-concept metadata catalogue has been described and deployed in a cloud-based architecture as a demonstration of an operationalized field spectroscopy metadata standard and web-based discovery service.
A dataset on the species composition of amphipods (Crustacea) in a Mexican marine national park: Alacranes Reef, Yucatan.

Science.gov (United States)

Paz-Ríos, Carlos E; Simões, Nuno; Pech, Daniel

2018-01-01

Alacranes Reef was declared as a National Marine Park in 1994. Since then, many efforts have been made to inventory its biodiversity. However, groups such as amphipods have been underestimated or not considered when benthic invertebrates were inventoried. Here we present a dataset that contributes to the knowledge of benthic amphipods (Crustacea, Peracarida) from the inner lagoon habitats from the Alacranes Reef National Park, the largest coral reef ecosystem in the Gulf of Mexico. The dataset contains information on records collected from 2009 to 2011. Data are available through Global Biodiversity Information Facility (GBIF). A total of 110 amphipod species distributed in 93 nominal species and 17 generic species, belonging to 71 genera, 33 families and three suborders are presented here. This information represents the first online dataset of amphipods from the Alacranes Reef National Park. The biological material is currently deposited in the crustacean collection from the regional unit of the National Autonomous University of Mexico located at Sisal, Yucatan, Mexico (UAS-Sisal). The biological material includes 588 data records with a total abundance of 6,551 organisms. The species inventory represents, until now, the richest fauna of benthic amphipods registered from any discrete coral reef ecosystem in Mexico.
Environmental Dataset Gateway (EDG) CS-W Interface

Data.gov (United States)

U.S. Environmental Protection Agency — Use the Environmental Dataset Gateway (EDG) to find and access EPA's environmental resources. Many options are available for easily reusing EDG content in other...
Relevance of octanol-water distribution measurements to the potential ecological uptake of multi-walled carbon nanotubes.

Science.gov (United States)

Petersen, Elijah J; Huang, Qingguo; Weber, Walter J

2010-05-01

Many potential applications of carbon nanotubes (CNTs) require various physicochemical modifications prior to use, suggesting that nanotubes having varied properties may pose risks in ecosystems. A means for estimating bioaccumulation potentials of variously modified CNTs for incorporation in predictive fate models would be highly valuable. An approach commonly used for sparingly soluble organic contaminants, and previously suggested for use as well with carbonaceous nanomaterials, involves measurement of their octanol-water partitioning coefficient (KOW) values. To test the applicability of this approach, a methodology was developed to measure apparent octanol-water distribution behaviors for purified multi-walled carbon nanotubes and those acid treated. Substantial differences in apparent distribution coefficients between the two types of CNTs were observed, but these differences did not influence accumulation by either earthworms (Eisenia foetida) or oligochaetes (Lumbriculus variegatus), both of which showed minimal nanotube uptake for both types of nanotubes. The results suggest that traditional distribution behavior-based KOW approaches are likely not appropriate for predicting CNT bioaccumulation. Copyright (c) 2010 SETAC.
Annotating spatio-temporal datasets for meaningful analysis in the Web

Science.gov (United States)

Stasch, Christoph; Pebesma, Edzer; Scheider, Simon

2014-05-01

More and more environmental datasets that vary in space and time are available in the Web. This comes along with an advantage of using the data for other purposes than originally foreseen, but also with the danger that users may apply inappropriate analysis procedures due to lack of important assumptions made during the data collection process. In order to guide towards a meaningful (statistical) analysis of spatio-temporal datasets available in the Web, we have developed a Higher-Order-Logic formalism that captures some relevant assumptions in our previous work [1]. It allows to proof on meaningful spatial prediction and aggregation in a semi-automated fashion. In this poster presentation, we will present a concept for annotating spatio-temporal datasets available in the Web with concepts defined in our formalism. Therefore, we have defined a subset of the formalism as a Web Ontology Language (OWL) pattern. It allows capturing the distinction between the different spatio-temporal variable types, i.e. point patterns, fields, lattices and trajectories, that in turn determine whether a particular dataset can be interpolated or aggregated in a meaningful way using a certain procedure. The actual annotations that link spatio-temporal datasets with the concepts in the ontology pattern are provided as Linked Data. In order to allow data producers to add the annotations to their datasets, we have implemented a Web portal that uses a triple store at the backend to store the annotations and to make them available in the Linked Data cloud. Furthermore, we have implemented functions in the statistical environment R to retrieve the RDF annotations and, based on these annotations, to support a stronger typing of spatio-temporal datatypes guiding towards a meaningful analysis in R. [1] Stasch, C., Scheider, S., Pebesma, E., Kuhn, W. (2014): "Meaningful spatial prediction and aggregation", Environmental Modelling & Software, 51, 149-165.
Mapping current and potential distribution of non-native Prosopis juliflora in the Afar region of Ethiopia.

Science.gov (United States)

Wakie, Tewodros T; Evangelista, Paul H; Jarnevich, Catherine S; Laituri, Melinda

2014-01-01

We used correlative models with species occurrence points, Moderate Resolution Imaging Spectroradiometer (MODIS) vegetation indices, and topo-climatic predictors to map the current distribution and potential habitat of invasive Prosopis juliflora in Afar, Ethiopia. Time-series of MODIS Enhanced Vegetation Indices (EVI) and Normalized Difference Vegetation Indices (NDVI) with 250 m2 spatial resolution were selected as remote sensing predictors for mapping distributions, while WorldClim bioclimatic products and generated topographic variables from the Shuttle Radar Topography Mission product (SRTM) were used to predict potential infestations. We ran Maxent models using non-correlated variables and the 143 species- occurrence points. Maxent generated probability surfaces were converted into binary maps using the 10-percentile logistic threshold values. Performances of models were evaluated using area under the receiver-operating characteristic (ROC) curve (AUC). Our results indicate that the extent of P. juliflora invasion is approximately 3,605 km2 in the Afar region (AUC = 0.94), while the potential habitat for future infestations is 5,024 km2 (AUC = 0.95). Our analyses demonstrate that time-series of MODIS vegetation indices and species occurrence points can be used with Maxent modeling software to map the current distribution of P. juliflora, while topo-climatic variables are good predictors of potential habitat in Ethiopia. Our results can quantify current and future infestations, and inform management and policy decisions for containing P. juliflora. Our methods can also be replicated for managing invasive species in other East African countries.
Evolving hard problems: Generating human genetics datasets with a complex etiology

Directory of Open Access Journals (Sweden)

Himmelstein Daniel S

2011-07-01

Full Text Available Abstract Background A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Results Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. Conclusions This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.
A Dataset from TIMSS to Examine the Relationship between Computer Use and Mathematics Achievement

Science.gov (United States)

Kadijevich, Djordje M.

2015-01-01

Because the relationship between computer use and achievement is still puzzling, there is a need to prepare and analyze good quality datasets on computer use and achievement. Such a dataset can be derived from TIMSS data. This paper describes how this dataset can be prepared. It also gives an example of how the dataset may be analyzed. The…
Mapping Current and Potential Distribution of Non-Native Prosopis juliflora in the Afar Region of Ethiopia

OpenAIRE

Wakie, Tewodros T.; Evangelista, Paul H.; Jarnevich, Catherine S.; Laituri, Melinda

2014-01-01

We used correlative models with species occurrence points, Moderate Resolution Imaging Spectroradiometer (MODIS) vegetation indices, and topo-climatic predictors to map the current distribution and potential habitat of invasive Prosopis juliflora in Afar, Ethiopia. Time-series of MODIS Enhanced Vegetation Indices (EVI) and Normalized Difference Vegetation Indices (NDVI) with 250 m2 spatial resolution were selected as remote sensing predictors for mapping distributions, while WorldClim bioclim...

GTI: a novel algorithm for identifying outlier gene expression profiles from integrated microarray datasets.

Directory of Open Access Journals (Sweden)

John Patrick Mpindi

Full Text Available BACKGROUND: Meta-analysis of gene expression microarray datasets presents significant challenges for statistical analysis. We developed and validated a new bioinformatic method for the identification of genes upregulated in subsets of samples of a given tumour type ('outlier genes', a hallmark of potential oncogenes. METHODOLOGY: A new statistical method (the gene tissue index, GTI was developed by modifying and adapting algorithms originally developed for statistical problems in economics. We compared the potential of the GTI to detect outlier genes in meta-datasets with four previously defined statistical methods, COPA, the OS statistic, the t-test and ORT, using simulated data. We demonstrated that the GTI performed equally well to existing methods in a single study simulation. Next, we evaluated the performance of the GTI in the analysis of combined Affymetrix gene expression data from several published studies covering 392 normal samples of tissue from the central nervous system, 74 astrocytomas, and 353 glioblastomas. According to the results, the GTI was better able than most of the previous methods to identify known oncogenic outlier genes. In addition, the GTI identified 29 novel outlier genes in glioblastomas, including TYMS and CDKN2A. The over-expression of these genes was validated in vivo by immunohistochemical staining data from clinical glioblastoma samples. Immunohistochemical data were available for 65% (19 of 29 of these genes, and 17 of these 19 genes (90% showed a typical outlier staining pattern. Furthermore, raltitrexed, a specific inhibitor of TYMS used in the therapy of tumour types other than glioblastoma, also effectively blocked cell proliferation in glioblastoma cell lines, thus highlighting this outlier gene candidate as a potential therapeutic target. CONCLUSIONS/SIGNIFICANCE: Taken together, these results support the GTI as a novel approach to identify potential oncogene outliers and drug targets. The algorithm is
Data Recommender: An Alternative Way to Discover Open Scientific Datasets

Science.gov (United States)

Klump, J. F.; Devaraju, A.; Williams, G.; Hogan, D.; Davy, R.; Page, J.; Singh, D.; Peterson, N.

2017-12-01

Over the past few years, institutions and government agencies have adopted policies to openly release their data, which has resulted in huge amounts of open data becoming available on the web. When trying to discover the data, users face two challenges: an overload of choice and the limitations of the existing data search tools. On the one hand, there are too many datasets to choose from, and therefore, users need to spend considerable effort to find the datasets most relevant to their research. On the other hand, data portals commonly offer keyword and faceted search, which depend fully on the user queries to search and rank relevant datasets. Consequently, keyword and faceted search may return loosely related or irrelevant results, although the results may contain the same query. They may also return highly specific results that depend more on how well metadata was authored. They do not account well for variance in metadata due to variance in author styles and preferences. The top-ranked results may also come from the same data collection, and users are unlikely to discover new and interesting datasets. These search modes mainly suits users who can express their information needs in terms of the structure and terminology of the data portals, but may pose a challenge otherwise. The above challenges reflect that we need a solution that delivers the most relevant (i.e., similar and serendipitous) datasets to users, beyond the existing search functionalities on the portals. A recommender system is an information filtering system that presents users with relevant and interesting contents based on users' context and preferences. Delivering data recommendations to users can make data discovery easier, and as a result may enhance user engagement with the portal. We developed a hybrid data recommendation approach for the CSIRO Data Access Portal. The approach leverages existing recommendation techniques (e.g., content-based filtering and item co-occurrence) to produce
Spatial Structure of Cities and Distribution of Retail Sales - Analysis based on a Potential NEG Model (Japanese)

OpenAIRE

NAKAMURA Ryohei; TAKATSUKA Hajime

2009-01-01

This paper aims to estimate the retail sales turnover in cities based on a potential New Economic Geography (NEG) model, addressing how (the distribution of) sales turnover is explained by the spatial structure of cities. This also takes into account the population distribution, by treating the spatial distribution of population and retail sales turnover in the cities by district and street (cho and chome) data. The cities covered are prefectural cities excluding government ordinance cities l...
Data assimilation and model evaluation experiment datasets

Science.gov (United States)

Lai, Chung-Cheng A.; Qian, Wen; Glenn, Scott M.

1994-01-01

The Institute for Naval Oceanography, in cooperation with Naval Research Laboratories and universities, executed the Data Assimilation and Model Evaluation Experiment (DAMEE) for the Gulf Stream region during fiscal years 1991-1993. Enormous effort has gone into the preparation of several high-quality and consistent datasets for model initialization and verification. This paper describes the preparation process, the temporal and spatial scopes, the contents, the structure, etc., of these datasets. The goal of DAMEE and the need of data for the four phases of experiment are briefly stated. The preparation of DAMEE datasets consisted of a series of processes: (1) collection of observational data; (2) analysis and interpretation; (3) interpolation using the Optimum Thermal Interpolation System package; (4) quality control and re-analysis; and (5) data archiving and software documentation. The data products from these processes included a time series of 3D fields of temperature and salinity, 2D fields of surface dynamic height and mixed-layer depth, analysis of the Gulf Stream and rings system, and bathythermograph profiles. To date, these are the most detailed and high-quality data for mesoscale ocean modeling, data assimilation, and forecasting research. Feedback from ocean modeling groups who tested this data was incorporated into its refinement. Suggestions for DAMEE data usages include (1) ocean modeling and data assimilation studies, (2) diagnosis and theoretical studies, and (3) comparisons with locally detailed observations.
Emerging ecological datasets with application for modeling North American dust emissions

Science.gov (United States)

McCord, S.; Stauffer, N. G.; Garman, S.; Webb, N.

2017-12-01

In 2011 the US Bureau of Land Management (BLM) established the Assessment, Inventory and Monitoring (AIM) program to monitor the condition of BLM land and to provide data to support evidence-based management of multi-use public lands. The monitoring program shares core data collection methods with the Natural Resources Conservation Service's (NRCS) National Resources Inventory (NRI), implemented on private lands nationally. Combined, the two programs have sampled >30,000 locations since 2003 to provide vegetation composition, vegetation canopy height, the size distribution of inter-canopy gaps, soil texture and crusting information on rangelands and pasture lands across North America. The BLM implements AIM on more than 247.3 million acres of land across the western US, encompassing major dust source regions of the Chihuahuan, Sonoran, Mojave and Great Basin deserts, the Colorado Plateau, and potential high-latitude dust sources in Alaska. The AIM data are publicly available and can be used to support modeling of land surface and boundary-layer processes, including dust emission. While understanding US dust source regions and emission processes has been of national interest since the 1930s Dust Bowl, most attention has been directed to the croplands of the Great Plains and emission hot spots like Owens Lake, California. The magnitude, spatial extent and temporal dynamics of dust emissions from western dust source areas remain highly uncertain. Here, we use ensemble modeling with empirical and physically-based dust emission schemes applied to AIM monitoring data to assess regional-scale patterns of aeolian sediment mass fluxes and dust emissions. The analysis enables connections to be made between dust emission rates at source and other indicators of ecosystem function at the landscape scale. Emerging ecological datasets like AIM provide new opportunities to evaluate aeolian sediment transport responses to land surface conditions, potential interactions with
Artificial intelligence (AI) systems for interpreting complex medical datasets.

Science.gov (United States)

Altman, R B

2017-05-01

Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability. © 2017 ASCPT.
Full-Scale Approximations of Spatio-Temporal Covariance Models for Large Datasets

KAUST Repository

Zhang, Bohai; Sang, Huiyan; Huang, Jianhua Z.

2014-01-01

of dataset and application of such models is not feasible for large datasets. This article extends the full-scale approximation (FSA) approach by Sang and Huang (2012) to the spatio-temporal context to reduce computational complexity. A reversible jump Markov
PERFORMANCE COMPARISON FOR INTRUSION DETECTION SYSTEM USING NEURAL NETWORK WITH KDD DATASET

Directory of Open Access Journals (Sweden)

S. Devaraju

2014-04-01

Full Text Available Intrusion Detection Systems are challenging task for finding the user as normal user or attack user in any organizational information systems or IT Industry. The Intrusion Detection System is an effective method to deal with the kinds of problem in networks. Different classifiers are used to detect the different kinds of attacks in networks. In this paper, the performance of intrusion detection is compared with various neural network classifiers. In the proposed research the four types of classifiers used are Feed Forward Neural Network (FFNN, Generalized Regression Neural Network (GRNN, Probabilistic Neural Network (PNN and Radial Basis Neural Network (RBNN. The performance of the full featured KDD Cup 1999 dataset is compared with that of the reduced featured KDD Cup 1999 dataset. The MATLAB software is used to train and test the dataset and the efficiency and False Alarm Rate is measured. It is proved that the reduced dataset is performing better than the full featured dataset.
Review of ATLAS Open Data 8 TeV datasets, tools and activities

CERN Document Server

The ATLAS collaboration

2018-01-01

The ATLAS Collaboration has released two 8 TeV datasets and relevant simulated samples to the public for educational use. A number of groups within ATLAS have used these ATLAS Open Data 8 TeV datasets, developing tools and educational material to promote particle physics. The general aim of these activities is to provide simple and user-friendly interactive interfaces to simulate the procedures used by high-energy physics researchers. International Masterclasses introduce particle physics to high school students and have been studying 8 TeV ATLAS Open Data since 2015. Inspired by this success, a new ATLAS Open Data initiative was launched in 2016 for university students. A comprehensive educational platform was thus developed featuring a second 8 TeV dataset and a new set of educational tools. The 8 TeV datasets and associated tools are presented and discussed here, as well as a selection of activities studying the ATLAS Open Data 8 TeV datasets.
Recent Development on the NOAA's Global Surface Temperature Dataset

Science.gov (United States)

Zhang, H. M.; Huang, B.; Boyer, T.; Lawrimore, J. H.; Menne, M. J.; Rennie, J.

2016-12-01

Global Surface Temperature (GST) is one of the most widely used indicators for climate trend and extreme analyses. A widely used GST dataset is the NOAA merged land-ocean surface temperature dataset known as NOAAGlobalTemp (formerly MLOST). The NOAAGlobalTemp had recently been updated from version 3.5.4 to version 4. The update includes a significant improvement in the ocean surface component (Extended Reconstructed Sea Surface Temperature or ERSST, from version 3b to version 4) which resulted in an increased temperature trends in recent decades. Since then, advancements in both the ocean component (ERSST) and land component (GHCN-Monthly) have been made, including the inclusion of Argo float SSTs and expanded EOT modes in ERSST, and the use of ISTI databank in GHCN-Monthly. In this presentation, we describe the impact of those improvements on the merged global temperature dataset, in terms of global trends and other aspects.
The OXL format for the exchange of integrated datasets

Directory of Open Access Journals (Sweden)

Taubert Jan

2007-12-01

Full Text Available A prerequisite for systems biology is the integration and analysis of heterogeneous experimental data stored in hundreds of life-science databases and millions of scientific publications. Several standardised formats for the exchange of specific kinds of biological information exist. Such exchange languages facilitate the integration process; however they are not designed to transport integrated datasets. A format for exchanging integrated datasets needs to i cover data from a broad range of application domains, ii be flexible and extensible to combine many different complex data structures, iii include metadata and semantic definitions, iv include inferred information, v identify the original data source for integrated entities and vi transport large integrated datasets. Unfortunately, none of the exchange formats from the biological domain (e.g. BioPAX, MAGE-ML, PSI-MI, SBML or the generic approaches (RDF, OWL fulfil these requirements in a systematic way.
Developing a Data-Set for Stereopsis

Directory of Open Access Journals (Sweden)

D.W Hunter

2014-08-01

Full Text Available Current research on binocular stereopsis in humans and non-human primates has been limited by a lack of available data-sets. Current data-sets fall into two categories; stereo-image sets with vergence but no ranging information (Hibbard, 2008, Vision Research, 48(12, 1427-1439 or combinations of depth information with binocular images and video taken from cameras in fixed fronto-parallel configurations exhibiting neither vergence or focus effects (Hirschmuller & Scharstein, 2007, IEEE Conf. Computer Vision and Pattern Recognition. The techniques for generating depth information are also imperfect. Depth information is normally inaccurate or simply missing near edges and on partially occluded surfaces. For many areas of vision research these are the most interesting parts of the image (Goutcher, Hunter, Hibbard, 2013, i-Perception, 4(7, 484; Scarfe & Hibbard, 2013, Vision Research. Using state-of-the-art open-source ray-tracing software (PBRT as a back-end, our intention is to release a set of tools that will allow researchers in this field to generate artificial binocular stereoscopic data-sets. Although not as realistic as photographs, computer generated images have significant advantages in terms of control over the final output and ground-truth information about scene depth is easily calculated at all points in the scene, even partially occluded areas. While individual researchers have been developing similar stimuli by hand for many decades, we hope that our software will greatly reduce the time and difficulty of creating naturalistic binocular stimuli. Our intension in making this presentation is to elicit feedback from the vision community about what sort of features would be desirable in such software.
BASE MAP DATASET, MAYES COUNTY, OKLAHOMA, USA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications: cadastral, geodetic control,...
Data analysis and mapping of the mountain permafrost distribution

Science.gov (United States)

Deluigi, Nicola; Lambiel, Christophe; Kanevski, Mikhail

2017-04-01

In Alpine environments mountain permafrost is defined as a thermal state of the ground and corresponds to any lithosphere material that is at or below 0°C for, at least, two years. Its degradation is potentially leading to an increasing rock fall activity, rock glacier accelerations and an increase in the sediment transfer rates. During the last 15 years, knowledge on this phenomenon has significantly increased thanks to many studies and monitoring projects. They revealed a spatial distribution extremely heterogeneous and complex. As a consequence, modelling the potential extent of the mountain permafrost recently became a very important task. Although existing statistical models generally offer a good overview at a regional scale, they are not always able to reproduce its strong spatial discontinuity at the micro scale. To overcome this lack, the objective of this study is to propose an alternative modelling approach using three classification algorithms belonging to statistics and machine learning: Logistic regression (LR), Support Vector Machines (SVM) and Random forests (RF). The former is a linear parametric classifier that commonly used as a benchmark classification algorithm to be employed before using more complex classifiers. Non-linear SVM is a non-parametric learning algorithm and it is a member of the so-called kernel methods. RF are an ensemble learning method based on bootstrap aggregating and offer an embedded measure of the variable importance. Permafrost evidences were selected in a 588 km2 area of the Western Swiss Alps and serve as training examples. They were mapped from field data (thermal and geoelectrical data) and ortho-image interpretation (rock glacier inventorying). The dataset was completed with environmental predictors such as altitude, mean annual air temperature, aspect, slope, potential incoming solar radiation, normalized difference vegetation index and planar, profile and combined terrain curvature indices. Aiming at predicting
PENERAPAN TEKNIK BAGGING PADA ALGORITMA KLASIFIKASI UNTUK MENGATASI KETIDAKSEIMBANGAN KELAS DATASET MEDIS

Directory of Open Access Journals (Sweden)

Rizki Tri Prasetio

2016-03-01

Full Text Available ABSTRACT – The class imbalance problems have been reported to severely hinder classiﬁcation performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different ﬁelds. Therefore, a number of methods, such as sampling methods, cost-sensitive learning methods, and bagging and boosting based ensemble methods, have been proposed to solve these problems. Some medical dataset has two classes has two classes or binominal experiencing an imbalance that causes lack of accuracy in classification. This research proposed a combination technique of bagging and algorithms of classification to improve the accuracy of medical datasets. Bagging technique used to solve the problem of imbalanced class. The proposed method is applied on three classifier algorithm i.e., naïve bayes, decision tree and k-nearest neighbor. This research uses five medical datasets obtained from UCI Machine Learning i.e.., breast-cancer, liver-disorder, heart-disease, pima-diabetes and vertebral column. Results of this research indicate that the proposed method makes a significant improvement on two algorithms of classification i.e. decision tree with p value of t-Test 0.0184 and k-nearest neighbor with p value of t-Test 0.0292, but not significant in naïve bayes with p value of t-Test 0.9236. After bagging technique applied at five medical datasets, naïve bayes has the highest accuracy for breast-cancer dataset of 96.14% with AUC of 0.984, heart-disease of 84.44% with AUC of 0.911 and pima-diabetes of 74.73% with AUC of 0.806. While the k-nearest neighbor has the best accuracy for dataset liver-disorder of 62.03% with AUC of 0.632 and vertebral-column of 82.26% with the AUC of 0.867. Keywords: ensemble technique, bagging, imbalanced class, medical dataset. ABSTRAKSI – Masalah ketidakseimbangan kelas telah dilaporkan sangat menghambat kinerja klasifikasi banyak algoritma klasifikasi dan telah menarik banyak perhatian dari
CERC Dataset (Full Hadza Data)

DEFF Research Database (Denmark)

2016-01-01

The dataset includes demographic, behavioral, and religiosity data from eight different populations from around the world. The samples were drawn from: (1) Coastal and (2) Inland Tanna, Vanuatu; (3) Hadzaland, Tanzania; (4) Lovu, Fiji; (5) Pointe aux Piment, Mauritius; (6) Pesqueiro, Brazil; (7......) Kyzyl, Tyva Republic; and (8) Yasawa, Fiji. Related publication: Purzycki, et al. (2016). Moralistic Gods, Supernatural Punishment and the Expansion of Human Sociality. Nature, 530(7590): 327-330....
Modelling spatial distribution of snails transmitting parasitic worms with importance to human and animal health and analysis of distributional changes in relation to climate

Directory of Open Access Journals (Sweden)

Ulrik B. Pedersen

2014-05-01

Full Text Available The environment, the on-going global climate change and the ecology of animal species determine the localisation of habitats and the geographical distribution of the various species in nature. The aim of this study was to explore the effects of such changes on snail species not only of interest to naturalists but also of importance to human and animal health. The spatial distribution of freshwater snail intermediate hosts involved in the transmission of schistosomiasis, fascioliasis and paramphistomiasis (i.e. Bulinus globosus, Biomphalaria pfeifferi and Lymnaea natalensis were modelled by the use of a maximum entropy algorithm (Maxent. Two snail observation datasets from Zimbabwe, from 1988 and 2012, were com- pared in terms of geospatial distribution and potential distributional change over this 24-year period investigated. Climate data, from the two years were identified and used in a species distribution modelling framework to produce maps of pre- dicted suitable snail habitats. Having both climate- and snail observation data spaced 24 years in time represent a unique opportunity to evaluate biological response of snails to changes in climate variables. The study shows that snail habitat suit- ability is highly variable in Zimbabwe with foci mainly in the central Highveld but also in areas to the South and West. It is further demonstrated that the spatial distribution of suitable habitats changes with variation in the climatic conditions, and that this parallels that of the predicted climate change.
Access and scientific exploitation of planetary plasma datasets with the CDPP/AMDA web-based tool

Science.gov (United States)

Andre, Nicolas

2012-07-01

The field of planetary sciences has greatly expanded in recent years with space missions orbiting around most of the planets of our Solar System. The growing amount and wealth of data available make it difficult for scientists to exploit data coming from many sources that can initially be heterogeneous in their organization, description and format. It is an important objective of the Europlanet-RI (supported by EU within FP7) to add value to space missions by significantly contributing to the effective scientific exploitation of collected data; to enable space researchers to take full advantage of the potential value of data sets. To this end and to enhance the science return from space missions, innovative tools have to be developed and offered to the community. AMDA (Automated Multi-Dataset Analysis, http://cdpp-amda.cesr.fr/) is a web-based facility developed at CDPP Toulouse in France (http://cdpp.cesr.fr) for on line analysis of space physics data (heliosphere, magnetospheres, planetary environments) coming from either its local database or distant ones. AMDA has been recently integrated as a service to the scientific community for the Plasma Physics thematic node of the Europlanet-RI IDIS (Integrated and Distributed Information Service, http://www.europlanet-idis.fi/) activities, in close cooperation with IWF Graz (http://europlanet-plasmanode.oeaw.ac.at/index.php?id=9). We will report the status of our current technical and scientific efforts to integrate in the local database of AMDA various planetary plasma datasets (at Mercury, Venus, Mars, Earth and Moon, Jupiter, Saturn) from heterogeneous sources, including NASA/Planetary Data System (http://ppi.pds.nasa.gov/). We will also present our prototype Virtual Observatory activities to connect the AMDA tool to the IVOA Aladin astrophysical tool to enable pluridisciplinary studies of giant planet auroral emissions. This presentation will be done on behalf of the CDPP Team and Europlanet-RI IDIS plasma node
Structural instability of sheath potential distribution and its possible implications for the L/H transition in tokamak plasmas

International Nuclear Information System (INIS)

Yoshida, Zensho; Yamada, Hiroshi.

1988-07-01

The Bohm equation of electrostatic potential distributions in one-dimensional plasmas has been studied for various Mach numbers and plasma potentials. Solvability and structural stability have been discussed using the Sagdeev potential. Implications of the structural stability for the L/H transitions in tokamak plasmas has been also discussed. (author)
Electricity distribution as an unsustainable natural monopoly. A potential outcome of New Zealand's regulatory regime

International Nuclear Information System (INIS)

Gunn, C.; Sharp, B.

1999-01-01

The ongoing reform of New Zealand's electricity supply industry has attempted to separate its potentially competitive elements from those with naturally monopolistic characteristics. Yet, some competition for distribution services is occurring, raising the question as to whether electricity distributors are natural monopolies as is typically assumed. This paper presents a simple model of a representative New Zealand distribution business, and shows that, in a true economic sense, distributors are most probably sustainable natural monopolies as expected. However, the model demonstrates that a mechanism for competition may arise because the financial principles enshrined in the Ministry of Commerce's regulatory regime can produce unsustainable cost structures and unintentionally introduce elements of contestability into the market for distribution services. 15 refs

Error characterisation of global active and passive microwave soil moisture datasets

Directory of Open Access Journals (Sweden)

W. A. Dorigo

2010-12-01

Full Text Available Understanding the error structures of remotely sensed soil moisture observations is essential for correctly interpreting observed variations and trends in the data or assimilating them in hydrological or numerical weather prediction models. Nevertheless, a spatially coherent assessment of the quality of the various globally available datasets is often hampered by the limited availability over space and time of reliable in-situ measurements. As an alternative, this study explores the triple collocation error estimation technique for assessing the relative quality of several globally available soil moisture products from active (ASCAT and passive (AMSR-E and SSM/I microwave sensors. The triple collocation is a powerful statistical tool to estimate the root mean square error while simultaneously solving for systematic differences in the climatologies of a set of three linearly related data sources with independent error structures. Prerequisite for this technique is the availability of a sufficiently large number of timely corresponding observations. In addition to the active and passive satellite-based datasets, we used the ERA-Interim and GLDAS-NOAH reanalysis soil moisture datasets as a third, independent reference. The prime objective is to reveal trends in uncertainty related to different observation principles (passive versus active, the use of different frequencies (C-, X-, and Ku-band for passive microwave observations, and the choice of the independent reference dataset (ERA-Interim versus GLDAS-NOAH. The results suggest that the triple collocation method provides realistic error estimates. Observed spatial trends agree well with the existing theory and studies on the performance of different observation principles and frequencies with respect to land cover and vegetation density. In addition, if all theoretical prerequisites are fulfilled (e.g. a sufficiently large number of common observations is available and errors of the different
Modelling road accident blackspots data with the discrete generalized Pareto distribution.

Science.gov (United States)

Prieto, Faustino; Gómez-Déniz, Emilio; Sarabia, José María

2014-10-01

This study shows how road traffic networks events, in particular road accidents on blackspots, can be modelled with simple probabilistic distributions. We considered the number of crashes and the number of fatalities on Spanish blackspots in the period 2003-2007, from Spanish General Directorate of Traffic (DGT). We modelled those datasets, respectively, with the discrete generalized Pareto distribution (a discrete parametric model with three parameters) and with the discrete Lomax distribution (a discrete parametric model with two parameters, and particular case of the previous model). For that, we analyzed the basic properties of both parametric models: cumulative distribution, survival, probability mass, quantile and hazard functions, genesis and rth-order moments; applied two estimation methods of their parameters: the μ and (μ+1) frequency method and the maximum likelihood method; used two goodness-of-fit tests: Chi-square test and discrete Kolmogorov-Smirnov test based on bootstrap resampling; and compared them with the classical negative binomial distribution in terms of absolute probabilities and in models including covariates. We found that those probabilistic models can be useful to describe the road accident blackspots datasets analyzed. Copyright © 2014 Elsevier Ltd. All rights reserved.
Synthetic ALSPAC longitudinal datasets for the Big Data VR project.

Science.gov (United States)

Avraam, Demetris; Wilson, Rebecca C; Burton, Paul

2017-01-01

Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information. In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.
Correction of elevation offsets in multiple co-located lidar datasets

Science.gov (United States)

Thompson, David M.; Dalyander, P. Soupy; Long, Joseph W.; Plant, Nathaniel G.

2017-04-07

IntroductionTopographic elevation data collected with airborne light detection and ranging (lidar) can be used to analyze short- and long-term changes to beach and dune systems. Analysis of multiple lidar datasets at Dauphin Island, Alabama, revealed systematic, island-wide elevation differences on the order of 10s of centimeters (cm) that were not attributable to real-world change and, therefore, were likely to represent systematic sampling offsets. These offsets vary between the datasets, but appear spatially consistent within a given survey. This report describes a method that was developed to identify and correct offsets between lidar datasets collected over the same site at different times so that true elevation changes over time, associated with sediment accumulation or erosion, can be analyzed.
BASE MAP DATASET, HONOLULU COUNTY, HAWAII, USA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, LOS ANGELES COUNTY, CALIFORNIA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, CHEROKEE COUNTY, SOUTH CAROLINA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, EDGEFIELD COUNTY, SOUTH CAROLINA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
BASE MAP DATASET, SANTA CRIZ COUNTY, CALIFORNIA

Data.gov (United States)

Federal Emergency Management Agency, Department of Homeland Security — FEMA Framework Basemap datasets comprise six of the seven FGDC themes of geospatial data that are used by most GIS applications (Note: the seventh framework theme,...
Dataset from the global phosphoproteomic mapping of early mitotic exit in human cells

Directory of Open Access Journals (Sweden)

Samuel Rogers

2015-12-01

Full Text Available The presence or absence of a phosphorylation on a substrate at any particular point in time is a functional readout of the balance in activity between the regulatory kinase and the counteracting phosphatase. Understanding how stable or short-lived a phosphorylation site is required for fully appreciating the biological consequences of the phosphorylation. Our current understanding of kinases and their substrates is well established; however, the role phosphatases play is less understood. Therefore, we utilized a phosphatase dependent model of mitotic exit to identify potential substrates that are preferentially dephosphorylated. Using this method, we identified >16,000 phosphosites on >3300 unique proteins, and quantified the temporal phosphorylation changes that occur during early mitotic exit (McCloy et al., 2015 [1]. Furthermore, we annotated the majority of these phosphorylation sites with a high confidence upstream kinase using published, motif and prediction based methods. The results from this study have been deposited into the ProteomeXchange repository with identifier PXD001559. Here we provide additional analysis of this dataset; for each of the major mitotic kinases we identified motifs that correlated strongly with phosphorylation status. These motifs could be used to predict the stability of phosphorylated residues in proteins of interest, and help infer potential functional roles for uncharacterized phosphorylations. In addition, we provide validation at the single cell level that serine residues phosphorylated by Cdk are stable during phosphatase dependent mitotic exit. In summary, this unique dataset contains information on the temporal mitotic stability of thousands of phosphorylation sites regulated by dozens of kinases, and information on the potential preference that phosphatases have at both the protein and individual phosphosite level. The compellation of this data provides an invaluable resource for the wider research
Distribution of convection potential around the polar cap boundary as a function of the interplanetary magnetic field

International Nuclear Information System (INIS)

Lu, G.; Reiff, P.H.; Karty, J.L.; Hairston, M.R.; Heelis, R.A.

1989-01-01

Plasma flow data from the AE-C, AE-D and DE 2 satellites have been used to systematically study the distribution of the convection potential around the polar cap boundary under a variety of different interplanetary magnetic field (IMF) conditions. For either a garden hose (B x B y x B y >0) orientation of the IMF, the potential distribution is mainly affected by the sign of B y . In the northern hemisphere, the zero potential line (which separates the dusk convection cell from the dawn cell) on the dayside shifts duskward as B y changes from positive to negative. But in the southern hemisphere, a dawnward shift has been found, although the uncertainties are large. The typical range of displacement is about ±1.5 hours MLT. Note that this shift is in the opposite direction from most simple schematic models of ionospheric flow; this reflects the fact that the polar cap boundary is typically more poleward than the flow reversal associated with the region 1 current system, which shifts in the opposite direction. Thus the enhanced flow region typically crosses noon. In most cases a sine wave is an adequate representation of the distribution of potential around the boundary. However, in a few cases the data favors (at the 80% confidence level) a steeper gradient near noon, more indicative of a throat. The potential drop at the duskside boundary is almost greater than at the dawnside boundary. A slight duskward shift of the patterns observed as the IMF changes from garden hose to ortho-garden hose conditions. Analytic equipotential contours, given the potential function as a boundary condition, are constructed for several IMF conditions
BigWig and BigBed: enabling browsing of large distributed datasets.

Science.gov (United States)

Kent, W J; Zweig, A S; Barber, G; Hinrichs, A S; Karolchik, D

2010-09-01

BigWig and BigBed files are compressed binary indexed files containing data at several resolutions that allow the high-performance display of next-generation sequencing experiment results in the UCSC Genome Browser. The visualization is implemented using a multi-layered software approach that takes advantage of specific capabilities of web-based protocols and Linux and UNIX operating systems files, R trees and various indexing and compression tricks. As a result, only the data needed to support the current browser view is transmitted rather than the entire file, enabling fast remote access to large distributed data sets. Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/. Source code for the creation and visualization software is freely available for non-commercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, implemented in C and supported on Linux. The UCSC Genome Browser is available at http://genome.ucsc.edu.
Scale and shape mixtures of multivariate skew-normal distributions

KAUST Repository

Arellano-Valle, Reinaldo B.

2018-02-26

We introduce a broad and flexible class of multivariate distributions obtained by both scale and shape mixtures of multivariate skew-normal distributions. We present the probabilistic properties of this family of distributions in detail and lay down the theoretical foundations for subsequent inference with this model. In particular, we study linear transformations, marginal distributions, selection representations, stochastic representations and hierarchical representations. We also describe an EM-type algorithm for maximum likelihood estimation of the parameters of the model and demonstrate its implementation on a wind dataset. Our family of multivariate distributions unifies and extends many existing models of the literature that can be seen as submodels of our proposal.
A Review of Multivariate Distributions for Count Data Derived from the Poisson Distribution.

Science.gov (United States)

Inouye, David; Yang, Eunho; Allen, Genevera; Ravikumar, Pradeep

2017-01-01

The Poisson distribution has been widely studied and used for modeling univariate count-valued data. Multivariate generalizations of the Poisson distribution that permit dependencies, however, have been far less popular. Yet, real-world high-dimensional count-valued data found in word counts, genomics, and crime statistics, for example, exhibit rich dependencies, and motivate the need for multivariate distributions that can appropriately model this data. We review multivariate distributions derived from the univariate Poisson, categorizing these models into three main classes: 1) where the marginal distributions are Poisson, 2) where the joint distribution is a mixture of independent multivariate Poisson distributions, and 3) where the node-conditional distributions are derived from the Poisson. We discuss the development of multiple instances of these classes and compare the models in terms of interpretability and theory. Then, we empirically compare multiple models from each class on three real-world datasets that have varying data characteristics from different domains, namely traffic accident data, biological next generation sequencing data, and text data. These empirical experiments develop intuition about the comparative advantages and disadvantages of each class of multivariate distribution that was derived from the Poisson. Finally, we suggest new research directions as explored in the subsequent discussion section.
Satellite-Based Precipitation Datasets

Science.gov (United States)

Munchak, S. J.; Huffman, G. J.

2017-12-01

Of the possible sources of precipitation data, those based on satellites provide the greatest spatial coverage. There is a wide selection of datasets, algorithms, and versions from which to choose, which can be confusing to non-specialists wishing to use the data. The International Precipitation Working Group (IPWG) maintains tables of the major publicly available, long-term, quasi-global precipitation data sets (http://www.isac.cnr.it/ ipwg/data/datasets.html), and this talk briefly reviews the various categories. As examples, NASA provides two sets of quasi-global precipitation data sets: the older Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) and current Integrated Multi-satellitE Retrievals for Global Precipitation Measurement (GPM) mission (IMERG). Both provide near-real-time and post-real-time products that are uniformly gridded in space and time. The TMPA products are 3-hourly 0.25°x0.25° on the latitude band 50°N-S for about 16 years, while the IMERG products are half-hourly 0.1°x0.1° on 60°N-S for over 3 years (with plans to go to 16+ years in Spring 2018). In addition to the precipitation estimates, each data set provides fields of other variables, such as the satellite sensor providing estimates and estimated random error. The discussion concludes with advice about determining suitability for use, the necessity of being clear about product names and versions, and the need for continued support for satellite- and surface-based observation.
Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

Science.gov (United States)

Yoon, Chun Hong; Demirci, Hasan; Sierra, Raymond G.; Dao, E. Han; Ahmadi, Radman; Aksit, Fulya; Aquila, Andrew L.; Batyuk, Alexander; Ciftci, Halilibrahim; Guillet, Serge; Hayes, Matt J.; Hayes, Brandon; Lane, Thomas J.; Liang, Meng; Lundström, Ulf; Koglin, Jason E.; Mgbam, Paul; Rao, Yashas; Rendahl, Theodore; Rodriguez, Evan; Zhang, Lindsey; Wakatsuki, Soichi; Boutet, Sébastien; Holton, James M.; Hunter, Mark S.

2017-04-01

We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C10H16N2O3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operate simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development.
U.S. Climate Divisional Dataset (Version Superseded)

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — This data has been superseded by a newer version of the dataset. Please refer to NOAA's Climate Divisional Database for more information. The U.S. Climate Divisional...
Spectral methods in machine learning and new strategies for very large datasets

Science.gov (United States)

Belabbas, Mohamed-Ali; Wolfe, Patrick J.

2009-01-01

Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these—based on sampling—leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach—based on sorting—provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods. PMID:19129490
UK surveillance: provision of quality assured information from combined datasets.

Science.gov (United States)

Paiba, G A; Roberts, S R; Houston, C W; Williams, E C; Smith, L H; Gibbens, J C; Holdship, S; Lysons, R

2007-09-14

Surveillance information is most useful when provided within a risk framework, which is achieved by presenting results against an appropriate denominator. Often the datasets are captured separately and for different purposes, and will have inherent errors and biases that can be further confounded by the act of merging. The United Kingdom Rapid Analysis and Detection of Animal-related Risks (RADAR) system contains data from several sources and provides both data extracts for research purposes and reports for wider stakeholders. Considerable efforts are made to optimise the data in RADAR during the Extraction, Transformation and Loading (ETL) process. Despite efforts to ensure data quality, the final dataset inevitably contains some data errors and biases, most of which cannot be rectified during subsequent analysis. So, in order for users to establish the 'fitness for purpose' of data merged from more than one data source, Quality Statements are produced as defined within the overarching surveillance Quality Framework. These documents detail identified data errors and biases following ETL and report construction as well as relevant aspects of the datasets from which the data originated. This paper illustrates these issues using RADAR datasets, and describes how they can be minimised.
Mapping current and potential distribution of non-native Prosopis juliflora in the Afar region of Ethiopia.

Directory of Open Access Journals (Sweden)

Tewodros T Wakie

Full Text Available We used correlative models with species occurrence points, Moderate Resolution Imaging Spectroradiometer (MODIS vegetation indices, and topo-climatic predictors to map the current distribution and potential habitat of invasive Prosopis juliflora in Afar, Ethiopia. Time-series of MODIS Enhanced Vegetation Indices (EVI and Normalized Difference Vegetation Indices (NDVI with 250 m2 spatial resolution were selected as remote sensing predictors for mapping distributions, while WorldClim bioclimatic products and generated topographic variables from the Shuttle Radar Topography Mission product (SRTM were used to predict potential infestations. We ran Maxent models using non-correlated variables and the 143 species- occurrence points. Maxent generated probability surfaces were converted into binary maps using the 10-percentile logistic threshold values. Performances of models were evaluated using area under the receiver-operating characteristic (ROC curve (AUC. Our results indicate that the extent of P. juliflora invasion is approximately 3,605 km2 in the Afar region (AUC = 0.94, while the potential habitat for future infestations is 5,024 km2 (AUC = 0.95. Our analyses demonstrate that time-series of MODIS vegetation indices and species occurrence points can be used with Maxent modeling software to map the current distribution of P. juliflora, while topo-climatic variables are good predictors of potential habitat in Ethiopia. Our results can quantify current and future infestations, and inform management and policy decisions for containing P. juliflora. Our methods can also be replicated for managing invasive species in other East African countries.

C3PO - A dynamic data placement agent for ATLAS distributed data management

CERN Document Server

AUTHOR|(INSPIRE)INSPIRE-00346910; The ATLAS collaboration; Lassnig, Mario; Barisits, Martin-Stefan; Serfon, Cedric; Garonne, Vincent

2017-01-01

This paper introduces a new dynamic data placement agent for the ATLAS distributed data management system. This agent is designed to pre-place potentially popular data to make it more widely available. It therefore incorporates information from a variety of sources. Those include input datasets and sites workload information from the ATLAS workload management system, network metrics from different sources like FTS and PerfSonar, historical popularity data collected through a tracer mechanism and more. With this data it decides if, when and where to place new replicas that then can be used by the WMS to distribute the workload more evenly over available computing resources and then ultimately reduce job waiting times. This paper gives an overview of the architecture and the final implementation of this new agent. The paper also includes an evaluation of the placement algorithm by comparing the transfer times and the new replica usage.
Discovering New Global Climate Patterns: Curating a 21-Year High Temporal (Hourly) and Spatial (40km) Resolution Reanalysis Dataset

Science.gov (United States)

Hou, C. Y.; Dattore, R.; Peng, G. S.

2014-12-01

The National Center for Atmospheric Research's Global Climate Four-Dimensional Data Assimilation (CFDDA) Hourly 40km Reanalysis dataset is a dynamically downscaled dataset with high temporal and spatial resolution. The dataset contains three-dimensional hourly analyses in netCDF format for the global atmospheric state from 1985 to 2005 on a 40km horizontal grid (0.4°grid increment) with 28 vertical levels, providing good representation of local forcing and diurnal variation of processes in the planetary boundary layer. This project aimed to make the dataset publicly available, accessible, and usable in order to provide a unique resource to allow and promote studies of new climate characteristics. When the curation project started, it had been five years since the data files were generated. Also, although the Principal Investigator (PI) had generated a user document at the end of the project in 2009, the document had not been maintained. Furthermore, the PI had moved to a new institution, and the remaining team members were reassigned to other projects. These factors made data curation in the areas of verifying data quality, harvest metadata descriptions, documenting provenance information especially challenging. As a result, the project's curation process found that: Data curator's skill and knowledge helped make decisions, such as file format and structure and workflow documentation, that had significant, positive impact on the ease of the dataset's management and long term preservation. Use of data curation tools, such as the Data Curation Profiles Toolkit's guidelines, revealed important information for promoting the data's usability and enhancing preservation planning. Involving data curators during each stage of the data curation life cycle instead of at the end could improve the curation process' efficiency. Overall, the project showed that proper resources invested in the curation process would give datasets the best chance to fulfill their potential to
Potential distribution of pine wilt disease under future climate change scenarios.

Directory of Open Access Journals (Sweden)

Akiko Hirata

Full Text Available Pine wilt disease (PWD constitutes a serious threat to pine forests. Since development depends on temperature and drought, there is a concern that future climate change could lead to the spread of PWD infections. We evaluated the risk of PWD in 21 susceptible Pinus species on a global scale. The MB index, which represents the sum of the difference between the mean monthly temperature and 15 when the mean monthly temperatures exceeds 15°C, was used to determine current and future regions vulnerable to PWD (MB ≥ 22. For future climate conditions, we compared the difference in PWD risks among four different representative concentration pathways (RCPs 2.6, 4.5, 6.0, and 8.5 and two time periods (2050s and 2070s. We also evaluated the impact of climate change on habitat suitability for each Pinus species using species distribution models. The findings were then integrated and the potential risk of PWD spread under climate change was discussed. Within the natural Pinus distribution area, southern parts of North America, Europe, and Asia were categorized as vulnerable regions (MB ≥ 22; 16% of the total Pinus distribution area. Representative provinces in which PWD has been reported at least once overlapped with the vulnerable regions. All RCP scenarios showed expansion of vulnerable regions in northern parts of Europe, Asia, and North America under future climate conditions. By the 2070s, under RCP 8.5, an estimated increase in the area of vulnerable regions to approximately 50% of the total Pinus distribution area was revealed. In addition, the habitat conditions of a large portion of the Pinus distribution areas in Europe and Asia were deemed unsuitable by the 2070s under RCP 8.5. Approximately 40% of these regions overlapped with regions deemed vulnerable to PWD, suggesting that Pinus forests in these areas are at risk of serious damage due to habitat shifts and spread of PWD.
The potential effects of climate change on the distribution and productivity of Cunninghamia lanceolata in China.

Science.gov (United States)

Liu, Yupeng; Yu, Deyong; Xun, Bin; Sun, Yun; Hao, Ruifang

2014-01-01

Climate changes may have immediate implications for forest productivity and may produce dramatic shifts in tree species distributions in the future. Quantifying these implications is significant for both scientists and managers. Cunninghamia lanceolata is an important coniferous timber species due to its fast growth and wide distribution in China. This paper proposes a methodology aiming at enhancing the distribution and productivity of C. lanceolata against a background of climate change. First, we simulated the potential distributions and establishment probabilities of C. lanceolata based on a species distribution model. Second, a process-based model, the PnET-II model, was calibrated and its parameterization of water balance improved. Finally, the improved PnET-II model was used to simulate the net primary productivity (NPP) of C. lanceolata. The simulated NPP and potential distribution were combined to produce an integrated indicator, the estimated total NPP, which serves to comprehensively characterize the productivity of the forest under climate change. The results of the analysis showed that (1) the distribution of C. lanceolata will increase in central China, but the mean probability of establishment will decrease in the 2050s; (2) the PnET-II model was improved, calibrated, and successfully validated for the simulation of the NPP of C. lanceolata in China; and (3) all scenarios predicted a reduction in total NPP in the 2050s, with a markedly lower reduction under the a2 scenario than under the b2 scenario. The changes in NPP suggested that forest productivity will show a large decrease in southern China and a mild increase in central China. All of these findings could improve our understanding of the impact of climate change on forest ecosystem structure and function and could provide a basis for policy-makers to apply adaptive measures and overcome the unfavorable influences of climate change.
Learning analytics: Dataset for empirical evaluation of entry requirements into engineering undergraduate programs in a Nigerian university.

Science.gov (United States)

Odukoya, Jonathan A; Popoola, Segun I; Atayero, Aderemi A; Omole, David O; Badejo, Joke A; John, Temitope M; Olowo, Olalekan O

2018-04-01

In Nigerian universities, enrolment into any engineering undergraduate program requires that the minimum entry criteria established by the National Universities Commission (NUC) must be satisfied. Candidates seeking admission to study engineering discipline must have reached a predetermined entry age and met the cut-off marks set for Senior School Certificate Examination (SSCE), Unified Tertiary Matriculation Examination (UTME), and the post-UTME screening. However, limited effort has been made to show that these entry requirements eventually guarantee successful academic performance in engineering programs because the data required for such validation are not readily available. In this data article, a comprehensive dataset for empirical evaluation of entry requirements into engineering undergraduate programs in a Nigerian university is presented and carefully analyzed. A total sample of 1445 undergraduates that were admitted between 2005 and 2009 to study Chemical Engineering (CHE), Civil Engineering (CVE), Computer Engineering (CEN), Electrical and Electronics Engineering (EEE), Information and Communication Engineering (ICE), Mechanical Engineering (MEE), and Petroleum Engineering (PET) at Covenant University, Nigeria were randomly selected. Entry age, SSCE aggregate, UTME score, Covenant University Scholastic Aptitude Screening (CUSAS) score, and the Cumulative Grade Point Average (CGPA) of the undergraduates were obtained from the Student Records and Academic Affairs unit. In order to facilitate evidence-based evaluation, the robust dataset is made publicly available in a Microsoft Excel spreadsheet file. On yearly basis, first-order descriptive statistics of the dataset are presented in tables. Box plot representations, frequency distribution plots, and scatter plots of the dataset are provided to enrich its value. Furthermore, correlation and linear regression analyses are performed to understand the relationship between the entry requirements and the
Climate Prediction Center IR 4km Dataset

Data.gov (United States)

National Oceanic and Atmospheric Administration, Department of Commerce — CPC IR 4km dataset was created from all available individual geostationary satellite data which have been merged to form nearly seamless global (60N-60S) IR...
Multivariate Analysis of Multiple Datasets: a Practical Guide for Chemical Ecology.

Science.gov (United States)

Hervé, Maxime R; Nicolè, Florence; Lê Cao, Kim-Anh

2018-03-01

Chemical ecology has strong links with metabolomics, the large-scale study of all metabolites detectable in a biological sample. Consequently, chemical ecologists are often challenged by the statistical analyses of such large datasets. This holds especially true when the purpose is to integrate multiple datasets to obtain a holistic view and a better understanding of a biological system under study. The present article provides a comprehensive resource to analyze such complex datasets using multivariate methods. It starts from the necessary pre-treatment of data including data transformations and distance calculations, to the application of both gold standard and novel multivariate methods for the integration of different omics data. We illustrate the process of analysis along with detailed results interpretations for six issues representative of the different types of biological questions encountered by chemical ecologists. We provide the necessary knowledge and tools with reproducible R codes and chemical-ecological datasets to practice and teach multivariate methods.
Analysis of k-means clustering approach on the breast cancer Wisconsin dataset.

Science.gov (United States)

Dubey, Ashutosh Kumar; Gupta, Umesh; Jain, Sonal

2016-11-01

Breast cancer is one of the most common cancers found worldwide and most frequently found in women. An early detection of breast cancer provides the possibility of its cure; therefore, a large number of studies are currently going on to identify methods that can detect breast cancer in its early stages. This study was aimed to find the effects of k-means clustering algorithm with different computation measures like centroid, distance, split method, epoch, attribute, and iteration and to carefully consider and identify the combination of measures that has potential of highly accurate clustering accuracy. K-means algorithm was used to evaluate the impact of clustering using centroid initialization, distance measures, and split methods. The experiments were performed using breast cancer Wisconsin (BCW) diagnostic dataset. Foggy and random centroids were used for the centroid initialization. In foggy centroid, based on random values, the first centroid was calculated. For random centroid, the initial centroid was considered as (0, 0). The results were obtained by employing k-means algorithm and are discussed with different cases considering variable parameters. The calculations were based on the centroid (foggy/random), distance (Euclidean/Manhattan/Pearson), split (simple/variance), threshold (constant epoch/same centroid), attribute (2-9), and iteration (4-10). Approximately, 92 % average positive prediction accuracy was obtained with this approach. Better results were found for the same centroid and the highest variance. The results achieved using Euclidean and Manhattan were better than the Pearson correlation. The findings of this work provided extensive understanding of the computational parameters that can be used with k-means. The results indicated that k-means has a potential to classify BCW dataset.
Harvard Aging Brain Study : Dataset and accessibility

NARCIS (Netherlands)

Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G.; Chatwal, Jasmeer P.; Papp, Kathryn V.; Amariglio, Rebecca E.; Blacker, Deborah; Rentz, Dorene M.; Johnson, Keith A.; Sperling, Reisa A.; Schultz, Aaron P.

2017-01-01

The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging.
Establishing macroecological trait datasets: digitalization, extrapolation, and validation of diet preferences in terrestrial mammals worldwide.

Science.gov (United States)

Kissling, Wilm Daniel; Dalby, Lars; Fløjgaard, Camilla; Lenoir, Jonathan; Sandel, Brody; Sandom, Christopher; Trøjelsgaard, Kristian; Svenning, Jens-Christian

2014-07-01

Ecological trait data are essential for understanding the broad-scale distribution of biodiversity and its response to global change. For animals, diet represents a fundamental aspect of species' evolutionary adaptations, ecological and functional roles, and trophic interactions. However, the importance of diet for macroevolutionary and macroecological dynamics remains little explored, partly because of the lack of comprehensive trait datasets. We compiled and evaluated a comprehensive global dataset of diet preferences of mammals ("MammalDIET"). Diet information was digitized from two global and cladewide data sources and errors of data entry by multiple data recorders were assessed. We then developed a hierarchical extrapolation procedure to fill-in diet information for species with missing information. Missing data were extrapolated with information from other taxonomic levels (genus, other species within the same genus, or family) and this extrapolation was subsequently validated both internally (with a jack-knife approach applied to the compiled species-level diet data) and externally (using independent species-level diet information from a comprehensive continentwide data source). Finally, we grouped mammal species into trophic levels and dietary guilds, and their species richness as well as their proportion of total richness were mapped at a global scale for those diet categories with good validation results. The success rate of correctly digitizing data was 94%, indicating that the consistency in data entry among multiple recorders was high. Data sources provided species-level diet information for a total of 2033 species (38% of all 5364 terrestrial mammal species, based on the IUCN taxonomy). For the remaining 3331 species, diet information was mostly extrapolated from genus-level diet information (48% of all terrestrial mammal species), and only rarely from other species within the same genus (6%) or from family level (8%). Internal and external
Meta-Analysis of High-Throughput Datasets Reveals Cellular Responses Following Hemorrhagic Fever Virus Infection

Directory of Open Access Journals (Sweden)

Gavin C. Bowick

2011-05-01

Full Text Available The continuing use of high-throughput assays to investigate cellular responses to infection is providing a large repository of information. Due to the large number of differentially expressed transcripts, often running into the thousands, the majority of these data have not been thoroughly investigated. Advances in techniques for the downstream analysis of high-throughput datasets are providing additional methods for the generation of additional hypotheses for further investigation. The large number of experimental observations, combined with databases that correlate particular genes and proteins with canonical pathways, functions and diseases, allows for the bioinformatic exploration of functional networks that may be implicated in replication or pathogenesis. Herein, we provide an example of how analysis of published high-throughput datasets of cellular responses to hemorrhagic fever virus infection can generate additional functional data. We describe enrichment of genes involved in metabolism, post-translational modification and cardiac damage; potential roles for specific transcription factors and a conserved involvement of a pathway based around cyclooxygenase-2. We believe that these types of analyses can provide virologists with additional hypotheses for continued investigation.
Acquiring a four-dimensional computed tomography dataset using an external respiratory signal

International Nuclear Information System (INIS)

Vedam, S S; Keall, P J; Kini, V R; Mostafavi, H; Shukla, H P; Mohan, R

2003-01-01

Four-dimensional (4D) methods strive to achieve highly conformal radiotherapy, particularly for lung and breast tumours, in the presence of respiratory-induced motion of tumours and normal tissues. Four-dimensional radiotherapy accounts for respiratory motion during imaging, planning and radiation delivery, and requires a 4D CT image in which the internal anatomy motion as a function of the respiratory cycle can be quantified. The aims of our research were (a) to develop a method to acquire 4D CT images from a spiral CT scan using an external respiratory signal and (b) to examine the potential utility of 4D CT imaging. A commercially available respiratory motion monitoring system provided an 'external' tracking signal of the patient's breathing. Simultaneous recording of a TTL 'X-Ray ON' signal from the CT scanner indicated the start time of CT image acquisition, thus facilitating time stamping of all subsequent images. An over-sampled spiral CT scan was acquired using a pitch of 0.5 and scanner rotation time of 1.5 s. Each image from such a scan was sorted into an image bin that corresponded with the phase of the respiratory cycle in which the image was acquired. The complete set of such image bins accumulated over a respiratory cycle constitutes a 4D CT dataset. Four-dimensional CT datasets of a mechanical oscillator phantom and a patient undergoing lung radiotherapy were acquired. Motion artefacts were significantly reduced in the images in the 4D CT dataset compared to the three-dimensional (3D) images, for which respiratory motion was not accounted. Accounting for respiratory motion using 4D CT imaging is feasible and yields images with less distortion than 3D images. 4D images also contain respiratory motion information not available in a 3D CT image
Establishment of Groundwater Arsenic Potential Distribution and Discrimination in Taiwan

Science.gov (United States)

Tsai, Kuo Sheng; Chen, Yu Ying; Chung Liu, Chih; Lin, Chien Wen

2016-04-01

According to the last 10 years groundwater monitoring data in Taiwan, Arsenic concentration increase rapidly in some areas, similar to Bengal and India, the main source of Arsenic-polluted groundwater is geological sediments, through reducing reactions. There are many researches indicate that high concentration of Arsenic in groundwater poses the risk to water safety, for example, the farm lands irrigation water contains Arsenic cause the concentration of Arsenic increase in soil and crops. Based on the management of water usage instead of remediation in the situation of insufficient water. Taiwan EPA has been developed the procedures of Arsenic contamination potential area establishment and source discriminated process. Taiwan EPA use the procedures to determine the management of using groundwater, and the proposing usage of Arsenic groundwater accordance with different objects. Agencies could cooperate with the water quality standard or water needs, studying appropriate water purification methods and the groundwater depth, water consumption, thus achieve the goal of water safety and environmental protection, as a reference of policy to control total Arsenic concentration in groundwater. Keywords: Arsenic; Distribution; Discrimination; Pollution potential area of Arsenic; Origin evaluation of groundwater Arsenic
Theoretical Analysis of Potential and Current Distributions in Planar Electrodes of Lithium-ion Batteries

International Nuclear Information System (INIS)

Taheri, Peyman; Mansouri, Abraham; Yazdanpour, Maryam; Bahrami, Majid

2014-01-01

An analytical model is proposed to describe the two-dimensional distribution of potential and current in planar electrodes of pouch-type lithium-ion batteries. A concentration-independent polarization expression, obtained experimentally, is used to mimic the electrochemical performance of the battery. By numerically solving the charge balance equation on each electrode in conjugation with the polarization expression, the battery behavior during constant-current discharge processes is simulated. Our numerical simulations show that reaction current between the electrodes remains approximately uniform during most of the discharge process, in particular, when depth-of-discharge varies from 5% to 85%. This observation suggests to simplify the electrochemical behavior of the battery such that the charge balance equation on each electrode can be solved analytically to obtain closed-form solutions for potential and current density distributions. The analytical model shows fair agreement with numerical data at modest computational cost. The model is applicable for both charge and discharge processes, and its application is demonstrated for a prismatic 20 Ah nickel-manganese-cobalt lithium-ion battery during discharge processes
Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

Science.gov (United States)

Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

2017-12-01

Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.
Comparing the accuracy of food outlet datasets in an urban environment

Directory of Open Access Journals (Sweden)

Michelle S. Wong

2017-05-01

Full Text Available Studies that investigate the relationship between the retail food environment and health outcomes often use geospatial datasets. Prior studies have identified challenges of using the most common data sources. Retail food environment datasets created through academic-government partnership present an alternative, but their validity (retail existence, type, location has not been assessed yet. In our study, we used ground-truth data to compare the validity of two datasets, a 2015 commercial dataset (InfoUSA and data collected from 2012 to 2014 through the Maryland Food Systems Mapping Project (MFSMP, an academic-government partnership, on the retail food environment in two low-income, inner city neighbourhoods in Baltimore City. We compared sensitivity and positive predictive value (PPV of the commercial and academic-government partnership data to ground-truth data for two broad categories of unhealthy food retailers: small food retailers and quick-service restaurants. Ground-truth data was collected in 2015 and analysed in 2016. Compared to the ground-truth data, MFSMP and InfoUSA generally had similar sensitivity that was greater than 85%. MFSMP had higher PPV compared to InfoUSA for both small food retailers (MFSMP: 56.3% vs InfoUSA: 40.7% and quick-service restaurants (MFSMP: 58.6% vs InfoUSA: 36.4%. We conclude that data from academic-government partnerships like MFSMP might be an attractive alternative option and improvement to relying only on commercial data. Other research institutes or cities might consider efforts to create and maintain such an environmental dataset. Even if these datasets cannot be updated on an annual basis, they are likely more accurate than commercial data.
Comparing the accuracy of food outlet datasets in an urban environment.

Science.gov (United States)

Wong, Michelle S; Peyton, Jennifer M; Shields, Timothy M; Curriero, Frank C; Gudzune, Kimberly A

2017-05-11

Studies that investigate the relationship between the retail food environment and health outcomes often use geospatial datasets. Prior studies have identified challenges of using the most common data sources. Retail food environment datasets created through academic-government partnership present an alternative, but their validity (retail existence, type, location) has not been assessed yet. In our study, we used ground-truth data to compare the validity of two datasets, a 2015 commercial dataset (InfoUSA) and data collected from 2012 to 2014 through the Maryland Food Systems Mapping Project (MFSMP), an academic-government partnership, on the retail food environment in two low-income, inner city neighbourhoods in Baltimore City. We compared sensitivity and positive predictive value (PPV) of the commercial and academic-government partnership data to ground-truth data for two broad categories of unhealthy food retailers: small food retailers and quick-service restaurants. Ground-truth data was collected in 2015 and analysed in 2016. Compared to the ground-truth data, MFSMP and InfoUSA generally had similar sensitivity that was greater than 85%. MFSMP had higher PPV compared to InfoUSA for both small food retailers (MFSMP: 56.3% vs InfoUSA: 40.7%) and quick-service restaurants (MFSMP: 58.6% vs InfoUSA: 36.4%). We conclude that data from academic-government partnerships like MFSMP might be an attractive alternative option and improvement to relying only on commercial data. Other research institutes or cities might consider efforts to create and maintain such an environmental dataset. Even if these datasets cannot be updated on an annual basis, they are likely more accurate than commercial data.
Mapping the spatial distribution of Aedes aegypti and Aedes albopictus.

Science.gov (United States)

Ding, Fangyu; Fu, Jingying; Jiang, Dong; Hao, Mengmeng; Lin, Gang

2018-02-01

Mosquito-borne infectious diseases, such as Rift Valley fever, Dengue, Chikungunya and Zika, have caused mass human death with the transnational expansion fueled by economic globalization. Simulating the distribution of the disease vectors is of great importance in formulating public health planning and disease control strategies. In the present study, we simulated the global distribution of Aedes aegypti and Aedes albopictus at a 5×5km spatial resolution with high-dimensional multidisciplinary datasets and machine learning methods Three relatively popular and robust machine learning models, including support vector machine (SVM), gradient boosting machine (GBM) and random forest (RF), were used. During the fine-tuning process based on training datasets of A. aegypti and A. albopictus, RF models achieved the highest performance with an area under the curve (AUC) of 0.973 and 0.974, respectively, followed by GBM (AUC of 0.971 and 0.972, respectively) and SVM (AUC of 0.963 and 0.964, respectively) models. The simulation difference between RF and GBM models was not statistically significant (p>0.05) based on the validation datasets, whereas statistically significant differences (p<0.05) were observed for RF and GBM simulations compared with SVM simulations. From the simulated maps derived from RF models, we observed that the distribution of A. albopictus was wider than that of A. aegypti along a latitudinal gradient. The discriminatory power of each factor in simulating the global distribution of the two species was also analyzed. Our results provided fundamental information for further study on disease transmission simulation and risk assessment. Copyright © 2017 Elsevier B.V. All rights reserved.
Detecting Distributed Scans Using High-Performance Query-DrivenVisualization

Energy Technology Data Exchange (ETDEWEB)

Stockinger, Kurt; Bethel, E. Wes; Campbell, Scott; Dart, Eli; Wu,Kesheng

2006-09-01

Modern forensic analytics applications, like network trafficanalysis, perform high-performance hypothesis testing, knowledgediscovery and data mining on very large datasets. One essential strategyto reduce the time required for these operations is to select only themost relevant data records for a given computation. In this paper, wepresent a set of parallel algorithms that demonstrate how an efficientselection mechanism -- bitmap indexing -- significantly speeds up acommon analysist ask, namely, computing conditional histogram on verylarge datasets. We present a thorough study of the performancecharacteristics of the parallel conditional histogram algorithms. Asacase study, we compute conditional histograms for detecting distributedscans hidden in a dataset consisting of approximately 2.5 billion networkconnection records. We show that these conditional histograms can becomputed on interactive timescale (i.e., in seconds). We also show how toprogressively modify the selection criteria to narrow the analysis andfind the sources of the distributed scans.
Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

Directory of Open Access Journals (Sweden)

H. E. Beck

2017-12-01

Full Text Available We undertook a comprehensive evaluation of 22 gridded (quasi-global (sub-daily precipitation (P datasets for the period 2000–2016. Thirteen non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76 086 gauges worldwide. Another nine gauge-corrected datasets were evaluated using hydrological modeling, by calibrating the HBV conceptual model against streamflow records for each of 9053 small to medium-sized ( <  50 000 km2 catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR and the satellite- and reanalysis-based CHIRP V2.0 dataset, the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7 or near-surface soil moisture (SM2RAIN-ASCAT, and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS. Two of the three reanalyses (ERA-Interim and JRA-55 unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified, and MSWEP V1.2 and V2.0 generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU, which in turn outperformed the one indirectly incorporating gauge data through another multi-source dataset (PERSIANN-CDR V1R1. Our results highlight large differences in estimation accuracy

Some links on this page may take you to non-federal websites. Their policies may differ from this site.