statistical data: Topics by WorldWideScience.org

Sample records for statistical data

Statistical data analysis using SAS intermediate statistical methods

CERN Document Server

Marasinghe, Mervyn G

2018-01-01

The aim of this textbook (previously titled SAS for Data Analytics) is to teach the use of SAS for statistical analysis of data for advanced undergraduate and graduate students in statistics, data science, and disciplines involving analyzing data. The book begins with an introduction beyond the basics of SAS, illustrated with non-trivial, real-world, worked examples. It proceeds to SAS programming and applications, SAS graphics, statistical analysis of regression models, analysis of variance models, analysis of variance with random and mixed effects models, and then takes the discussion beyond regression and analysis of variance to conclude. Pedagogically, the authors introduce theory and methodological basis topic by topic, present a problem as an application, followed by a SAS analysis of the data provided and a discussion of results. The text focuses on applied statistical problems and methods. Key features include: end of chapter exercises, downloadable SAS code and data sets, and advanced material suitab...
[Big data in official statistics].

Science.gov (United States)

Zwick, Markus

2015-08-01

The concept of "big data" stands to change the face of official statistics over the coming years, having an impact on almost all aspects of data production. The tasks of future statisticians will not necessarily be to produce new data, but rather to identify and make use of existing data to adequately describe social and economic phenomena. Until big data can be used correctly in official statistics, a lot of questions need to be answered and problems solved: the quality of data, data protection, privacy, and the sustainable availability are some of the more pressing issues to be addressed. The essential skills of official statisticians will undoubtedly change, and this implies a number of challenges to be faced by statistical education systems, in universities, and inside the statistical offices. The national statistical offices of the European Union have concluded a concrete strategy for exploring the possibilities of big data for official statistics, by means of the Big Data Roadmap and Action Plan 1.0. This is an important first step and will have a significant influence on implementing the concept of big data inside the statistical offices of Germany.
Baseline Statistics of Linked Statistical Data

NARCIS (Netherlands)

Scharnhorst, Andrea; Meroño-Peñuela, Albert; Guéret, Christophe

2014-01-01

We are surrounded by an ever increasing ocean of information, everybody will agree to that. We build sophisticated strategies to govern this information: design data models, develop infrastructures for data sharing, building tool for data analysis. Statistical datasets curated by National
School Violence: Data & Statistics

Science.gov (United States)

... Social Media Publications Injury Center School Violence: Data & Statistics Recommend on Facebook Tweet Share Compartir The first ... Vehicle Safety Traumatic Brain Injury Injury Response Data & Statistics (WISQARS) Funded Programs Press Room Social Media Publications ...
Cancer Data and Statistics Tools

Science.gov (United States)

... Educational Campaigns Initiatives Stay Informed Cancer Data and Statistics Tools Recommend on Facebook Tweet Share Compartir Cancer Statistics Tools United States Cancer Statistics: Data Visualizations The ...
Tuberculosis Data and Statistics

Science.gov (United States)

... Advisory Groups Federal TB Task Force Data and Statistics Language: English (US) Español (Spanish) Recommend on Facebook ... Set) Mortality and Morbidity Weekly Reports Data and Statistics Decrease in Reported Tuberculosis Cases MMWR 2010; 59 ( ...
Official statistics and Big Data

Directory of Open Access Journals (Sweden)

Peter Struijs

2014-07-01

Full Text Available The rise of Big Data changes the context in which organisations producing official statistics operate. Big Data provides opportunities, but in order to make optimal use of Big Data, a number of challenges have to be addressed. This stimulates increased collaboration between National Statistical Institutes, Big Data holders, businesses and universities. In time, this may lead to a shift in the role of statistical institutes in the provision of high-quality and impartial statistical information to society. In this paper, the changes in context, the opportunities, the challenges and the way to collaborate are addressed. The collaboration between the various stakeholders will involve each partner building on and contributing different strengths. For national statistical offices, traditional strengths include, on the one hand, the ability to collect data and combine data sources with statistical products and, on the other hand, their focus on quality, transparency and sound methodology. In the Big Data era of competing and multiplying data sources, they continue to have a unique knowledge of official statistical production methods. And their impartiality and respect for privacy as enshrined in law uniquely position them as a trusted third party. Based on this, they may advise on the quality and validity of information of various sources. By thus positioning themselves, they will be able to play their role as key information providers in a changing society.
Statistical methods for ranking data

CERN Document Server

Alvo, Mayer

2014-01-01

This book introduces advanced undergraduate, graduate students and practitioners to statistical methods for ranking data. An important aspect of nonparametric statistics is oriented towards the use of ranking data. Rank correlation is defined through the notion of distance functions and the notion of compatibility is introduced to deal with incomplete data. Ranking data are also modeled using a variety of modern tools such as CART, MCMC, EM algorithm and factor analysis. This book deals with statistical methods used for analyzing such data and provides a novel and unifying approach for hypotheses testing. The techniques described in the book are illustrated with examples and the statistical software is provided on the authors’ website.
Muscular Dystrophy: Data and Statistics

Science.gov (United States)

... For… Media Policy Makers MD STARnet Data and Statistics Recommend on Facebook Tweet Share Compartir Expand All Collapse All The following data and statistics come from MD STARnet. Data from the MD ...
Statistical modeling for degradation data

CERN Document Server

Lio, Yuhlong; Ng, Hon; Tsai, Tzong-Ru

2017-01-01

This book focuses on the statistical aspects of the analysis of degradation data. In recent years, degradation data analysis has come to play an increasingly important role in different disciplines such as reliability, public health sciences, and finance. For example, information on products’ reliability can be obtained by analyzing degradation data. In addition, statistical modeling and inference techniques have been developed on the basis of different degradation measures. The book brings together experts engaged in statistical modeling and inference, presenting and discussing important recent advances in degradation data analysis and related applications. The topics covered are timely and have considerable potential to impact both statistics and reliability engineering.
UN Data- Environmental Statistics: Waste

Data.gov (United States)

World Wide Human Geography Data Working Group — The Environment Statistics Database contains selected water and waste statistics by country. Statistics on water and waste are based on official statistics supplied...
UN Data: Environment Statistics: Waste

Data.gov (United States)

World Wide Human Geography Data Working Group — The Environment Statistics Database contains selected water and waste statistics by country. Statistics on water and waste are based on official statistics supplied...
Beginning statistics with data analysis

CERN Document Server

Mosteller, Frederick; Rourke, Robert EK

2013-01-01

This introduction to the world of statistics covers exploratory data analysis, methods for collecting data, formal statistical inference, and techniques of regression and analysis of variance. 1983 edition.
Advanced statistical methods in data science

CERN Document Server

Chen, Jiahua; Lu, Xuewen; Yi, Grace; Yu, Hao

2016-01-01

This book gathers invited presentations from the 2nd Symposium of the ICSA- CANADA Chapter held at the University of Calgary from August 4-6, 2015. The aim of this Symposium was to promote advanced statistical methods in big-data sciences and to allow researchers to exchange ideas on statistics and data science and to embraces the challenges and opportunities of statistics and data science in the modern world. It addresses diverse themes in advanced statistical analysis in big-data sciences, including methods for administrative data analysis, survival data analysis, missing data analysis, high-dimensional and genetic data analysis, longitudinal and functional data analysis, the design and analysis of studies with response-dependent and multi-phase designs, time series and robust statistics, statistical inference based on likelihood, empirical likelihood and estimating functions. The editorial group selected 14 high-quality presentations from this successful symposium and invited the presenters to prepare a fu...
Statistical methods for astronomical data analysis

CERN Document Server

Chattopadhyay, Asis Kumar

2014-01-01

This book introduces “Astrostatistics” as a subject in its own right with rewarding examples, including work by the authors with galaxy and Gamma Ray Burst data to engage the reader. This includes a comprehensive blending of Astrophysics and Statistics. The first chapter’s coverage of preliminary concepts and terminologies for astronomical phenomenon will appeal to both Statistics and Astrophysics readers as helpful context. Statistics concepts covered in the book provide a methodological framework. A unique feature is the inclusion of different possible sources of astronomical data, as well as software packages for converting the raw data into appropriate forms for data analysis. Readers can then use the appropriate statistical packages for their particular data analysis needs. The ideas of statistical inference discussed in the book help readers determine how to apply statistical tests. The authors cover different applications of statistical techniques already developed or specifically introduced for ...
Data and Statistics: Heart Failure

Science.gov (United States)

... Summary Coverdell Program 2012-2015 State Summaries Data & Statistics Fact Sheets Heart Disease and Stroke Fact Sheets ... Roadmap for State Planning Other Data Resources Other Statistic Resources Grantee Information Cross-Program Information Online Tools ...
Data Literacy is Statistical Literacy

Science.gov (United States)

Gould, Robert

2017-01-01

Past definitions of statistical literacy should be updated in order to account for the greatly amplified role that data now play in our lives. Experience working with high-school students in an innovative data science curriculum has shown that teaching statistical literacy, augmented by data literacy, can begin early.
Statistical data analysis handbook

National Research Council Canada - National Science Library

Wall, Francis J

1986-01-01

It must be emphasized that this is not a text book on statistics. Instead it is a working tool that presents data analysis in clear, concise terms which can be readily understood even by those without formal training in statistics...
Statistical analysis and data management

International Nuclear Information System (INIS)

Anon.

1981-01-01

This report provides an overview of the history of the WIPP Biology Program. The recommendations of the American Institute of Biological Sciences (AIBS) for the WIPP biology program are summarized. The data sets available for statistical analyses and problems associated with these data sets are also summarized. Biological studies base maps are presented. A statistical model is presented to evaluate any correlation between climatological data and small mammal captures. No statistically significant relationship between variance in small mammal captures on Dr. Gennaro's 90m x 90m grid and precipitation records from the Duval Potash Mine were found
Does environmental data collection need statistics?

NARCIS (Netherlands)

Pulles, M.P.J.

1998-01-01

The term 'statistics' with reference to environmental science and policymaking might mean different things: the development of statistical methodology, the methodology developed by statisticians to interpret and analyse such data, or the statistical data that are needed to understand environmental

Statistical Methods for Fuzzy Data

CERN Document Server

Viertl, Reinhard

2011-01-01

Statistical data are not always precise numbers, or vectors, or categories. Real data are frequently what is called fuzzy. Examples where this fuzziness is obvious are quality of life data, environmental, biological, medical, sociological and economics data. Also the results of measurements can be best described by using fuzzy numbers and fuzzy vectors respectively. Statistical analysis methods have to be adapted for the analysis of fuzzy data. In this book, the foundations of the description of fuzzy data are explained, including methods on how to obtain the characterizing function of fuzzy m
Data and Statistics

Science.gov (United States)

... About Us Information For… Media Policy Makers Data & Statistics Recommend on Facebook Tweet Share Compartir Sickle cell ... 1999 through 2002. This drop coincided with the introduction in 2000 of a vaccine that protects against ...
Equivalent statistics and data interpretation.

Science.gov (United States)

Francis, Gregory

2017-08-01

Recent reform efforts in psychological science have led to a plethora of choices for scientists to analyze their data. A scientist making an inference about their data must now decide whether to report a p value, summarize the data with a standardized effect size and its confidence interval, report a Bayes Factor, or use other model comparison methods. To make good choices among these options, it is necessary for researchers to understand the characteristics of the various statistics used by the different analysis frameworks. Toward that end, this paper makes two contributions. First, it shows that for the case of a two-sample t test with known sample sizes, many different summary statistics are mathematically equivalent in the sense that they are based on the very same information in the data set. When the sample sizes are known, the p value provides as much information about a data set as the confidence interval of Cohen's d or a JZS Bayes factor. Second, this equivalence means that different analysis methods differ only in their interpretation of the empirical data. At first glance, it might seem that mathematical equivalence of the statistics suggests that it does not matter much which statistic is reported, but the opposite is true because the appropriateness of a reported statistic is relative to the inference it promotes. Accordingly, scientists should choose an analysis method appropriate for their scientific investigation. A direct comparison of the different inferential frameworks provides some guidance for scientists to make good choices and improve scientific practice.
Powerful Statistical Inference for Nested Data Using Sufficient Summary Statistics

Science.gov (United States)

Dowding, Irene; Haufe, Stefan

2018-01-01

Hierarchically-organized data arise naturally in many psychology and neuroscience studies. As the standard assumption of independent and identically distributed samples does not hold for such data, two important problems are to accurately estimate group-level effect sizes, and to obtain powerful statistical tests against group-level null hypotheses. A common approach is to summarize subject-level data by a single quantity per subject, which is often the mean or the difference between class means, and treat these as samples in a group-level t-test. This “naive” approach is, however, suboptimal in terms of statistical power, as it ignores information about the intra-subject variance. To address this issue, we review several approaches to deal with nested data, with a focus on methods that are easy to implement. With what we call the sufficient-summary-statistic approach, we highlight a computationally efficient technique that can improve statistical power by taking into account within-subject variances, and we provide step-by-step instructions on how to apply this approach to a number of frequently-used measures of effect size. The properties of the reviewed approaches and the potential benefits over a group-level t-test are quantitatively assessed on simulated data and demonstrated on EEG data from a simulated-driving experiment. PMID:29615885
Birth Defects Data and Statistics

Science.gov (United States)

... Submit" /> Information For… Media Policy Makers Data & Statistics Recommend on Facebook Tweet Share Compartir On This ... and critical. Read below for the latest national statistics on the occurrence of birth defects in the ...
Spina Bifida Data and Statistics

Science.gov (United States)

... Us Information For… Media Policy Makers Data and Statistics Recommend on Facebook Tweet Share Compartir Spina bifida ... the spine. Read below for the latest national statistics on spina bifida in the United States. In ...
Statistical Models and Methods for Lifetime Data

CERN Document Server

Lawless, Jerald F

2011-01-01

Praise for the First Edition"An indispensable addition to any serious collection on lifetime data analysis and . . . a valuable contribution to the statistical literature. Highly recommended . . ."-Choice"This is an important book, which will appeal to statisticians working on survival analysis problems."-Biometrics"A thorough, unified treatment of statistical models and methods used in the analysis of lifetime data . . . this is a highly competent and agreeable statistical textbook."-Statistics in MedicineThe statistical analysis of lifetime or response time data is a key tool in engineering,
Topology for statistical modeling of petascale data.

Energy Technology Data Exchange (ETDEWEB)

Pascucci, Valerio (University of Utah, Salt Lake City, UT); Mascarenhas, Ajith Arthur; Rusek, Korben (Texas A& M University, College Station, TX); Bennett, Janine Camille; Levine, Joshua (University of Utah, Salt Lake City, UT); Pebay, Philippe Pierre; Gyulassy, Attila (University of Utah, Salt Lake City, UT); Thompson, David C.; Rojas, Joseph Maurice (Texas A& M University, College Station, TX)

2011-07-01

This document presents current technical progress and dissemination of results for the Mathematics for Analysis of Petascale Data (MAPD) project titled 'Topology for Statistical Modeling of Petascale Data', funded by the Office of Science Advanced Scientific Computing Research (ASCR) Applied Math program. Many commonly used algorithms for mathematical analysis do not scale well enough to accommodate the size or complexity of petascale data produced by computational simulations. The primary goal of this project is thus to develop new mathematical tools that address both the petascale size and uncertain nature of current data. At a high level, our approach is based on the complementary techniques of combinatorial topology and statistical modeling. In particular, we use combinatorial topology to filter out spurious data that would otherwise skew statistical modeling techniques, and we employ advanced algorithms from algebraic statistics to efficiently find globally optimal fits to statistical models. This document summarizes the technical advances we have made to date that were made possible in whole or in part by MAPD funding. These technical contributions can be divided loosely into three categories: (1) advances in the field of combinatorial topology, (2) advances in statistical modeling, and (3) new integrated topological and statistical methods.
Enterprise Human Resources Integration-Statistical Data Mart (EHRI-SDM) Status Data

Data.gov (United States)

Office of Personnel Management — The Enterprise Human Resources Integration-Statistical Data Mart (EHRI-SDM) is a statistically cleansed sub-set of the data contained in the EHRI data warehouse. It...
Enterprise Human Resources Integration-Statistical Data Mart (EHRI-SDM) Dynamics Data

Data.gov (United States)

Office of Personnel Management — The Enterprise Human Resources Integration-Statistical Data Mart (EHRI-SDM) is a statistically cleansed sub-set of the data contained in the EHRI data warehouse. It...
Statistical Inference for Data Adaptive Target Parameters.

Science.gov (United States)

Hubbard, Alan E; Kherad-Pajouh, Sara; van der Laan, Mark J

2016-05-01

Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming "data-driven", the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.
Statistical Literacy: Data Tell a Story

Science.gov (United States)

Sole, Marla A.

2016-01-01

Every day, students collect, organize, and analyze data to make decisions. In this data-driven world, people need to assess how much trust they can place in summary statistics. The results of every survey and the safety of every drug that undergoes a clinical trial depend on the correct application of appropriate statistics. Recognizing the…
DATA ON YOUTH, 1967, A STATISTICAL DOCUMENT.

Science.gov (United States)

SCHEIDER, GEORGE

THE DATA IN THIS REPORT ARE STATISTICS ON YOUTH THROUGHOUT THE UNITED STATES AND IN NEW YORK STATE. INCLUDED ARE DATA ON POPULATION, SCHOOL STATISTICS, EMPLOYMENT, FAMILY INCOME, JUVENILE DELINQUENCY AND YOUTH CRIME (INCLUDING NEW YORK CITY FIGURES), AND TRAFFIC ACCIDENTS. THE STATISTICS ARE PRESENTED IN THE TEXT AND IN TABLES AND CHARTS. (NH)
Statistics and analysis of scientific data

CERN Document Server

Bonamente, Massimiliano

2013-01-01

Statistics and Analysis of Scientific Data covers the foundations of probability theory and statistics, and a number of numerical and analytical methods that are essential for the present-day analyst of scientific data. Topics covered include probability theory, distribution functions of statistics, fits to two-dimensional datasheets and parameter estimation, Monte Carlo methods and Markov chains. Equal attention is paid to the theory and its practical application, and results from classic experiments in various fields are used to illustrate the importance of statistics in the analysis of scientific data. The main pedagogical method is a theory-then-application approach, where emphasis is placed first on a sound understanding of the underlying theory of a topic, which becomes the basis for an efficient and proactive use of the material for practical applications. The level is appropriate for undergraduates and beginning graduate students, and as a reference for the experienced researcher. Basic calculus is us...
A Statistical Toolkit for Data Analysis

International Nuclear Information System (INIS)

Donadio, S.; Guatelli, S.; Mascialino, B.; Pfeiffer, A.; Pia, M.G.; Ribon, A.; Viarengo, P.

2006-01-01

The present project aims to develop an open-source and object-oriented software Toolkit for statistical data analysis. Its statistical testing component contains a variety of Goodness-of-Fit tests, from Chi-squared to Kolmogorov-Smirnov, to less known, but generally much more powerful tests such as Anderson-Darling, Goodman, Fisz-Cramer-von Mises, Kuiper, Tiku. Thanks to the component-based design and the usage of the standard abstract interfaces for data analysis, this tool can be used by other data analysis systems or integrated in experimental software frameworks. This Toolkit has been released and is downloadable from the web. In this paper we describe the statistical details of the algorithms, the computational features of the Toolkit and describe the code validation
47 CFR 1.363 - Introduction of statistical data.

Science.gov (United States)

2010-10-01

... 47 Telecommunication 1 2010-10-01 2010-10-01 false Introduction of statistical data. 1.363 Section... Proceedings Evidence § 1.363 Introduction of statistical data. (a) All statistical studies, offered in... analyses, and experiments, and those parts of other studies involving statistical methodology shall be...
Data and Statistics: Women and Heart Disease

Science.gov (United States)

... Summary Coverdell Program 2012-2015 State Summaries Data & Statistics Fact Sheets Heart Disease and Stroke Fact Sheets ... Roadmap for State Planning Other Data Resources Other Statistic Resources Grantee Information Cross-Program Information Online Tools ...
Hemophilia Data and Statistics

Science.gov (United States)

... View public health webinars on blood disorders Data & Statistics Language: English (US) Español (Spanish) Recommend on Facebook ... genetic testing is done to diagnose hemophilia before birth. For the one-third ... rates and hospitalization rates for bleeding complications from hemophilia ...
A nonparametric spatial scan statistic for continuous data.

Science.gov (United States)

Jung, Inkyung; Cho, Ho Jin

2015-10-20

Spatial scan statistics are widely used for spatial cluster detection, and several parametric models exist. For continuous data, a normal-based scan statistic can be used. However, the performance of the model has not been fully evaluated for non-normal data. We propose a nonparametric spatial scan statistic based on the Wilcoxon rank-sum test statistic and compared the performance of the method with parametric models via a simulation study under various scenarios. The nonparametric method outperforms the normal-based scan statistic in terms of power and accuracy in almost all cases under consideration in the simulation study. The proposed nonparametric spatial scan statistic is therefore an excellent alternative to the normal model for continuous data and is especially useful for data following skewed or heavy-tailed distributions.
Statistical Methods for Unusual Count Data

DEFF Research Database (Denmark)

Guthrie, Katherine A.; Gammill, Hilary S.; Kamper-Jørgensen, Mads

2016-01-01

microchimerism data present challenges for statistical analysis, including a skewed distribution, excess zero values, and occasional large values. Methods for comparing microchimerism levels across groups while controlling for covariates are not well established. We compared statistical models for quantitative...... microchimerism values, applied to simulated data sets and 2 observed data sets, to make recommendations for analytic practice. Modeling the level of quantitative microchimerism as a rate via Poisson or negative binomial model with the rate of detection defined as a count of microchimerism genome equivalents per...

Statistical data filtration in neutron coincidence counting

International Nuclear Information System (INIS)

Beddingfield, D.H.; Menlove, H.O.

1992-11-01

We assessed the effectiveness of statistical data filtration to minimize the contribution of matrix materials in 200-ell drums to the nondestructive assay of plutonium. Those matrices were examined: polyethylene, concrete, aluminum, iron, cadmium, and lead. Statistical filtration of neutron coincidence data improved the low-end sensitivity of coincidence counters. Spurious data arising from electrical noise, matrix spallation, and geometric effects were smoothed in a predictable fashion by the statistical filter. The filter effectively lowers the minimum detectable mass limit that can be achieved for plutonium assay using passive neutron coincidence counting
Critical analysis of adsorption data statistically

Science.gov (United States)

Kaushal, Achla; Singh, S. K.

2017-10-01

Experimental data can be presented, computed, and critically analysed in a different way using statistics. A variety of statistical tests are used to make decisions about the significance and validity of the experimental data. In the present study, adsorption was carried out to remove zinc ions from contaminated aqueous solution using mango leaf powder. The experimental data was analysed statistically by hypothesis testing applying t test, paired t test and Chi-square test to (a) test the optimum value of the process pH, (b) verify the success of experiment and (c) study the effect of adsorbent dose in zinc ion removal from aqueous solutions. Comparison of calculated and tabulated values of t and χ 2 showed the results in favour of the data collected from the experiment and this has been shown on probability charts. K value for Langmuir isotherm was 0.8582 and m value for Freundlich adsorption isotherm obtained was 0.725, both are mango leaf powder.
Spatial Statistical Data Fusion (SSDF)

Science.gov (United States)

Braverman, Amy J.; Nguyen, Hai M.; Cressie, Noel

2013-01-01

As remote sensing for scientific purposes has transitioned from an experimental technology to an operational one, the selection of instruments has become more coordinated, so that the scientific community can exploit complementary measurements. However, tech nological and scientific heterogeneity across devices means that the statistical characteristics of the data they collect are different. The challenge addressed here is how to combine heterogeneous remote sensing data sets in a way that yields optimal statistical estimates of the underlying geophysical field, and provides rigorous uncertainty measures for those estimates. Different remote sensing data sets may have different spatial resolutions, different measurement error biases and variances, and other disparate characteristics. A state-of-the-art spatial statistical model was used to relate the true, but not directly observed, geophysical field to noisy, spatial aggregates observed by remote sensing instruments. The spatial covariances of the true field and the covariances of the true field with the observations were modeled. The observations are spatial averages of the true field values, over pixels, with different measurement noise superimposed. A kriging framework is used to infer optimal (minimum mean squared error and unbiased) estimates of the true field at point locations from pixel-level, noisy observations. A key feature of the spatial statistical model is the spatial mixed effects model that underlies it. The approach models the spatial covariance function of the underlying field using linear combinations of basis functions of fixed size. Approaches based on kriging require the inversion of very large spatial covariance matrices, and this is usually done by making simplifying assumptions about spatial covariance structure that simply do not hold for geophysical variables. In contrast, this method does not require these assumptions, and is also computationally much faster. This method is
Classification, (big) data analysis and statistical learning

CERN Document Server

Conversano, Claudio; Vichi, Maurizio

2018-01-01

This edited book focuses on the latest developments in classification, statistical learning, data analysis and related areas of data science, including statistical analysis of large datasets, big data analytics, time series clustering, integration of data from different sources, as well as social networks. It covers both methodological aspects as well as applications to a wide range of areas such as economics, marketing, education, social sciences, medicine, environmental sciences and the pharmaceutical industry. In addition, it describes the basic features of the software behind the data analysis results, and provides links to the corresponding codes and data sets where necessary. This book is intended for researchers and practitioners who are interested in the latest developments and applications in the field. The peer-reviewed contributions were presented at the 10th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, held in Santa Margherita di Pul...
Statistics and analysis of scientific data

CERN Document Server

Bonamente, Massimiliano

2017-01-01

The revised second edition of this textbook provides the reader with a solid foundation in probability theory and statistics as applied to the physical sciences, engineering and related fields. It covers a broad range of numerical and analytical methods that are essential for the correct analysis of scientific data, including probability theory, distribution functions of statistics, fits to two-dimensional data and parameter estimation, Monte Carlo methods and Markov chains. Features new to this edition include: • a discussion of statistical techniques employed in business science, such as multiple regression analysis of multivariate datasets. • a new chapter on the various measures of the mean including logarithmic averages. • new chapters on systematic errors and intrinsic scatter, and on the fitting of data with bivariate errors. • a new case study and additional worked examples. • mathematical derivations and theoretical background material have been appropriately marked,to improve the readabili...
Statistically significant relational data mining :

Energy Technology Data Exchange (ETDEWEB)

Berry, Jonathan W.; Leung, Vitus Joseph; Phillips, Cynthia Ann; Pinar, Ali; Robinson, David Gerald; Berger-Wolf, Tanya; Bhowmick, Sanjukta; Casleton, Emily; Kaiser, Mark; Nordman, Daniel J.; Wilson, Alyson G.

2014-02-01

This report summarizes the work performed under the project (3z(BStatitically significant relational data mining.(3y (BThe goal of the project was to add more statistical rigor to the fairly ad hoc area of data mining on graphs. Our goal was to develop better algorithms and better ways to evaluate algorithm quality. We concetrated on algorithms for community detection, approximate pattern matching, and graph similarity measures. Approximate pattern matching involves finding an instance of a relatively small pattern, expressed with tolerance, in a large graph of data observed with uncertainty. This report gathers the abstracts and references for the eight refereed publications that have appeared as part of this work. We then archive three pieces of research that have not yet been published. The first is theoretical and experimental evidence that a popular statistical measure for comparison of community assignments favors over-resolved communities over approximations to a ground truth. The second are statistically motivated methods for measuring the quality of an approximate match of a small pattern in a large graph. The third is a new probabilistic random graph model. Statisticians favor these models for graph analysis. The new local structure graph model overcomes some of the issues with popular models such as exponential random graph models and latent variable models.
Statistical data fusion for cross-tabulation

NARCIS (Netherlands)

Kamakura, W.A.; Wedel, M.

The authors address the situation in which a researcher wants to cross-tabulate two sets of discrete variables collected in independent samples, but a subset of the variables is common to both samples. The authors propose a statistical data-fusion model that allows for statistical tests of
Statistical Data Editing in Scientific Articles.

Science.gov (United States)

Habibzadeh, Farrokh

2017-07-01

Scientific journals are important scholarly forums for sharing research findings. Editors have important roles in safeguarding standards of scientific publication and should be familiar with correct presentation of results, among other core competencies. Editors do not have access to the raw data and should thus rely on clues in the submitted manuscripts. To identify probable errors, they should look for inconsistencies in presented results. Common statistical problems that can be picked up by a knowledgeable manuscript editor are discussed in this article. Manuscripts should contain a detailed section on statistical analyses of the data. Numbers should be reported with appropriate precisions. Standard error of the mean (SEM) should not be reported as an index of data dispersion. Mean (standard deviation [SD]) and median (interquartile range [IQR]) should be used for description of normally and non-normally distributed data, respectively. If possible, it is better to report 95% confidence interval (CI) for statistics, at least for main outcome variables. And, P values should be presented, and interpreted with caution, if there is a hypothesis. To advance knowledge and skills of their members, associations of journal editors are better to develop training courses on basic statistics and research methodology for non-experts. This would in turn improve research reporting and safeguard the body of scientific evidence. © 2017 The Korean Academy of Medical Sciences.
Statistical Analysis of Research Data | Center for Cancer Research

Science.gov (United States)

Recent advances in cancer biology have resulted in the need for increased statistical analysis of research data. The Statistical Analysis of Research Data (SARD) course will be held on April 5-6, 2018 from 9 a.m.-5 p.m. at the National Institutes of Health's Natcher Conference Center, Balcony C on the Bethesda Campus. SARD is designed to provide an overview on the general principles of statistical analysis of research data. The first day will feature univariate data analysis, including descriptive statistics, probability distributions, one- and two-sample inferential statistics.
Statistical analysis of next generation sequencing data

CERN Document Server

Nettleton, Dan

2014-01-01

Next Generation Sequencing (NGS) is the latest high throughput technology to revolutionize genomic research. NGS generates massive genomic datasets that play a key role in the big data phenomenon that surrounds us today. To extract signals from high-dimensional NGS data and make valid statistical inferences and predictions, novel data analytic and statistical techniques are needed. This book contains 20 chapters written by prominent statisticians working with NGS data. The topics range from basic preprocessing and analysis with NGS data to more complex genomic applications such as copy number variation and isoform expression detection. Research statisticians who want to learn about this growing and exciting area will find this book useful. In addition, many chapters from this book could be included in graduate-level classes in statistical bioinformatics for training future biostatisticians who will be expected to deal with genomic data in basic biomedical research, genomic clinical trials and personalized med...
Collecting operational event data for statistical analysis

International Nuclear Information System (INIS)

Atwood, C.L.

1994-09-01

This report gives guidance for collecting operational data to be used for statistical analysis, especially analysis of event counts. It discusses how to define the purpose of the study, the unit (system, component, etc.) to be studied, events to be counted, and demand or exposure time. Examples are given of classification systems for events in the data sources. A checklist summarizes the essential steps in data collection for statistical analysis
Statistical Analysis of Big Data on Pharmacogenomics

Science.gov (United States)

Fan, Jianqing; Liu, Han

2013-01-01

This paper discusses statistical methods for estimating complex correlation structure from large pharmacogenomic datasets. We selectively review several prominent statistical methods for estimating large covariance matrix for understanding correlation structure, inverse covariance matrix for network modeling, large-scale simultaneous tests for selecting significantly differently expressed genes and proteins and genetic markers for complex diseases, and high dimensional variable selection for identifying important molecules for understanding molecule mechanisms in pharmacogenomics. Their applications to gene network estimation and biomarker selection are used to illustrate the methodological power. Several new challenges of Big data analysis, including complex data distribution, missing data, measurement error, spurious correlation, endogeneity, and the need for robust statistical methods, are also discussed. PMID:23602905
Advances in statistical models for data analysis

CERN Document Server

Minerva, Tommaso; Vichi, Maurizio

2015-01-01

This edited volume focuses on recent research results in classification, multivariate statistics and machine learning and highlights advances in statistical models for data analysis. The volume provides both methodological developments and contributions to a wide range of application areas such as economics, marketing, education, social sciences and environment. The papers in this volume were first presented at the 9th biannual meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, held in September 2013 at the University of Modena and Reggio Emilia, Italy.
Statistical analysis of environmental data

International Nuclear Information System (INIS)

Beauchamp, J.J.; Bowman, K.O.; Miller, F.L. Jr.

1975-10-01

This report summarizes the analyses of data obtained by the Radiological Hygiene Branch of the Tennessee Valley Authority from samples taken around the Browns Ferry Nuclear Plant located in Northern Alabama. The data collection was begun in 1968 and a wide variety of types of samples have been gathered on a regular basis. The statistical analysis of environmental data involving very low-levels of radioactivity is discussed. Applications of computer calculations for data processing are described
Tourette Syndrome (TS): Data and Statistics

Science.gov (United States)

... Submit" /> Information For… Media Policy Makers Data & Statistics Recommend on Facebook Tweet Share Compartir * The data ... Behavioral or conduct problems, 26%; Anxiety problems, 49%; Depression, 25%; Autism spectrum disorder, 35%; Learning disability, 47%; ...
Method for statistical data analysis of multivariate observations

CERN Document Server

Gnanadesikan, R

1997-01-01

A practical guide for multivariate statistical techniques-- now updated and revised In recent years, innovations in computer technology and statistical methodologies have dramatically altered the landscape of multivariate data analysis. This new edition of Methods for Statistical Data Analysis of Multivariate Observations explores current multivariate concepts and techniques while retaining the same practical focus of its predecessor. It integrates methods and data-based interpretations relevant to multivariate analysis in a way that addresses real-world problems arising in many areas of inte
Topology for Statistical Modeling of Petascale Data

Energy Technology Data Exchange (ETDEWEB)

Pascucci, Valerio [Univ. of Utah, Salt Lake City, UT (United States); Levine, Joshua [Univ. of Utah, Salt Lake City, UT (United States); Gyulassy, Attila [Univ. of Utah, Salt Lake City, UT (United States); Bremer, P. -T. [Univ. of Utah, Salt Lake City, UT (United States)

2013-10-31

Many commonly used algorithms for mathematical analysis do not scale well enough to accommodate the size or complexity of petascale data produced by computational simulations. The primary goal of this project is to develop new mathematical tools that address both the petascale size and uncertain nature of current data. At a high level, the approach of the entire team involving all three institutions is based on the complementary techniques of combinatorial topology and statistical modelling. In particular, we use combinatorial topology to filter out spurious data that would otherwise skew statistical modelling techniques, and we employ advanced algorithms from algebraic statistics to efficiently find globally optimal fits to statistical models. The overall technical contributions can be divided loosely into three categories: (1) advances in the field of combinatorial topology, (2) advances in statistical modelling, and (3) new integrated topological and statistical methods. Roughly speaking, the division of labor between our 3 groups (Sandia Labs in Livermore, Texas A&M in College Station, and U Utah in Salt Lake City) is as follows: the Sandia group focuses on statistical methods and their formulation in algebraic terms, and finds the application problems (and data sets) most relevant to this project, the Texas A&M Group develops new algebraic geometry algorithms, in particular with fewnomial theory, and the Utah group develops new algorithms in computational topology via Discrete Morse Theory. However, we hasten to point out that our three groups stay in tight contact via videconference every 2 weeks, so there is much synergy of ideas between the groups. The following of this document is focused on the contributions that had grater direct involvement from the team at the University of Utah in Salt Lake City.
Testing the statistical compatibility of independent data sets

International Nuclear Information System (INIS)

Maltoni, M.; Schwetz, T.

2003-01-01

We discuss a goodness-of-fit method which tests the compatibility between statistically independent data sets. The method gives sensible results even in cases where the χ 2 minima of the individual data sets are very low or when several parameters are fitted to a large number of data points. In particular, it avoids the problem that a possible disagreement between data sets becomes diluted by data points which are insensitive to the crucial parameters. A formal derivation of the probability distribution function for the proposed test statistics is given, based on standard theorems of statistics. The application of the method is illustrated on data from neutrino oscillation experiments, and its complementarity to the standard goodness-of-fit is discussed
Measuring the data universe data integration using statistical data and metadata exchange

CERN Document Server

Stahl, Reinhold

2018-01-01

This richly illustrated book provides an easy-to-read introduction to the challenges of organizing and integrating modern data worlds, explaining the contribution of public statistics and the ISO standard SDMX (Statistical Data and Metadata Exchange). As such, it is a must for data experts as well those aspiring to become one. Today, exponentially growing data worlds are increasingly determining our professional and private lives. The rapid increase in the amount of globally available data, fueled by search engines and social networks but also by new technical possibilities such as Big Data, offers great opportunities. But whatever the undertaking – driving the block chain revolution or making smart phones even smarter – success will be determined by how well it is possible to integrate, i.e. to collect, link and evaluate, the required data. One crucial factor in this is the introduction of a cross-domain order system in combination with a standardization of the data structure. Using everyday examples, th...
Complex Data Modeling and Computationally Intensive Statistical Methods

CERN Document Server

Mantovan, Pietro

2010-01-01

The last years have seen the advent and development of many devices able to record and store an always increasing amount of complex and high dimensional data; 3D images generated by medical scanners or satellite remote sensing, DNA microarrays, real time financial data, system control datasets. The analysis of this data poses new challenging problems and requires the development of novel statistical models and computational methods, fueling many fascinating and fast growing research areas of modern statistics. The book offers a wide variety of statistical methods and is addressed to statistici

Basic statistical tools in research and data analysis

Directory of Open Access Journals (Sweden)

Zulfiqar Ali

2016-01-01

Full Text Available Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise only if proper statistical tests are used. This article will try to acquaint the reader with the basic research tools that are utilised while conducting various studies. The article covers a brief outline of the variables, an understanding of quantitative and qualitative variables and the measures of central tendency. An idea of the sample size estimation, power analysis and the statistical errors is given. Finally, there is a summary of parametric and non-parametric tests used for data analysis.
Big Data as a Source for Official Statistics

Directory of Open Access Journals (Sweden)

Daas Piet J.H.

2015-06-01

Full Text Available More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. This article discusses the exploration of both opportunities and challenges for official statistics associated with the application of Big Data. Experiences gained with analyses of large amounts of Dutch traffic loop detection records and Dutch social media messages are described to illustrate the topics characteristic of the statistical analysis and use of Big Data.
Kappa statistic for clustered matched-pair data.

Science.gov (United States)

Yang, Zhao; Zhou, Ming

2014-07-10

Kappa statistic is widely used to assess the agreement between two procedures in the independent matched-pair data. For matched-pair data collected in clusters, on the basis of the delta method and sampling techniques, we propose a nonparametric variance estimator for the kappa statistic without within-cluster correlation structure or distributional assumptions. The results of an extensive Monte Carlo simulation study demonstrate that the proposed kappa statistic provides consistent estimation and the proposed variance estimator behaves reasonably well for at least a moderately large number of clusters (e.g., K ≥50). Compared with the variance estimator ignoring dependence within a cluster, the proposed variance estimator performs better in maintaining the nominal coverage probability when the intra-cluster correlation is fair (ρ ≥0.3), with more pronounced improvement when ρ is further increased. To illustrate the practical application of the proposed estimator, we analyze two real data examples of clustered matched-pair data. Copyright © 2014 John Wiley & Sons, Ltd.
Proceedings of the Pacific Rim Statistical Conference for Production Engineering : Big Data, Production Engineering and Statistics

CERN Document Server

Jang, Daeheung; Lai, Tze; Lee, Youngjo; Lu, Ying; Ni, Jun; Qian, Peter; Qiu, Peihua; Tiao, George

2018-01-01

This book presents the proceedings of the 2nd Pacific Rim Statistical Conference for Production Engineering: Production Engineering, Big Data and Statistics, which took place at Seoul National University in Seoul, Korea in December, 2016. The papers included discuss a wide range of statistical challenges, methods and applications for big data in production engineering, and introduce recent advances in relevant statistical methods.
Uncertainty analysis with statistically correlated failure data

International Nuclear Information System (INIS)

Modarres, M.; Dezfuli, H.; Roush, M.L.

1987-01-01

Likelihood of occurrence of the top event of a fault tree or sequences of an event tree is estimated from the failure probability of components that constitute the events of the fault/event tree. Component failure probabilities are subject to statistical uncertainties. In addition, there are cases where the failure data are statistically correlated. At present most fault tree calculations are based on uncorrelated component failure data. This chapter describes a methodology for assessing the probability intervals for the top event failure probability of fault trees or frequency of occurrence of event tree sequences when event failure data are statistically correlated. To estimate mean and variance of the top event, a second-order system moment method is presented through Taylor series expansion, which provides an alternative to the normally used Monte Carlo method. For cases where component failure probabilities are statistically correlated, the Taylor expansion terms are treated properly. Moment matching technique is used to obtain the probability distribution function of the top event through fitting the Johnson Ssub(B) distribution. The computer program, CORRELATE, was developed to perform the calculations necessary for the implementation of the method developed. (author)
Challenges in computational statistics and data mining

CERN Document Server

Mielniczuk, Jan

2016-01-01

This volume contains nineteen research papers belonging to the areas of computational statistics, data mining, and their applications. Those papers, all written specifically for this volume, are their authors’ contributions to honour and celebrate Professor Jacek Koronacki on the occcasion of his 70th birthday. The book’s related and often interconnected topics, represent Jacek Koronacki’s research interests and their evolution. They also clearly indicate how close the areas of computational statistics and data mining are.
Enerdata statistical yearbook. ''the key-data of energy worldwide''. 1999 data

International Nuclear Information System (INIS)

2000-01-01

The new edition of the Enerdata statistical yearbook provides the most recent statistical data on energy (oil, gas, coal and power production) and CO 2 emissions worldwide for the 1994-1999 period of time. These data cover 52 countries and 12 geographic areas and are presented in the form of tables and graphs (production, foreign exchanges, consumptions, market shares, sectoral consumption, 1999 energy status, long-term tendencies). More data for a longer period (1970-1999) and for all countries worldwide are available on the CD-Rom version of the yearbook. (J.S.)
Application of Ontology Technology in Health Statistic Data Analysis.

Science.gov (United States)

Guo, Minjiang; Hu, Hongpu; Lei, Xingyun

2017-01-01

Research Purpose: establish health management ontology for analysis of health statistic data. Proposed Methods: this paper established health management ontology based on the analysis of the concepts in China Health Statistics Yearbook, and used protégé to define the syntactic and semantic structure of health statistical data. six classes of top-level ontology concepts and their subclasses had been extracted and the object properties and data properties were defined to establish the construction of these classes. By ontology instantiation, we can integrate multi-source heterogeneous data and enable administrators to have an overall understanding and analysis of the health statistic data. ontology technology provides a comprehensive and unified information integration structure of the health management domain and lays a foundation for the efficient analysis of multi-source and heterogeneous health system management data and enhancement of the management efficiency.
Analysis of Preference Data Using Intermediate Test Statistic Abstract

African Journals Online (AJOL)

PROF. O. E. OSUAGWU

2013-06-01

Jun 1, 2013 ... West African Journal of Industrial and Academic Research Vol.7 No. 1 June ... Keywords:-Preference data, Friedman statistic, multinomial test statistic, intermediate test statistic. ... new method and consequently a new statistic ...
Statistical treatment of fatigue test data

International Nuclear Information System (INIS)

Raske, D.T.

1980-01-01

This report discussed several aspects of fatigue data analysis in order to provide a basis for the development of statistically sound design curves. Included is a discussion on the choice of the dependent variable, the assumptions associated with least squares regression models, the variability of fatigue data, the treatment of data from suspended tests and outlying observations, and various strain-life relations
HistFitter software framework for statistical data analysis

Energy Technology Data Exchange (ETDEWEB)

Baak, M. [CERN, Geneva (Switzerland); Besjes, G.J. [Radboud University Nijmegen, Nijmegen (Netherlands); Nikhef, Amsterdam (Netherlands); Cote, D. [University of Texas, Arlington (United States); Koutsman, A. [TRIUMF, Vancouver (Canada); Lorenz, J. [Ludwig-Maximilians-Universitaet Muenchen, Munich (Germany); Excellence Cluster Universe, Garching (Germany); Short, D. [University of Oxford, Oxford (United Kingdom)

2015-04-15

We present a software framework for statistical data analysis, called HistFitter, that has been used extensively by the ATLAS Collaboration to analyze big datasets originating from proton-proton collisions at the Large Hadron Collider at CERN. Since 2012 HistFitter has been the standard statistical tool in searches for supersymmetric particles performed by ATLAS. HistFitter is a programmable and flexible framework to build, book-keep, fit, interpret and present results of data models of nearly arbitrary complexity. Starting from an object-oriented configuration, defined by users, the framework builds probability density functions that are automatically fit to data and interpreted with statistical tests. Internally HistFitter uses the statistics packages RooStats and HistFactory. A key innovation of HistFitter is its design, which is rooted in analysis strategies of particle physics. The concepts of control, signal and validation regions are woven into its fabric. These are progressively treated with statistically rigorous built-in methods. Being capable of working with multiple models at once that describe the data, HistFitter introduces an additional level of abstraction that allows for easy bookkeeping, manipulation and testing of large collections of signal hypotheses. Finally, HistFitter provides a collection of tools to present results with publication quality style through a simple command-line interface. (orig.)
HistFitter software framework for statistical data analysis

International Nuclear Information System (INIS)

Baak, M.; Besjes, G.J.; Cote, D.; Koutsman, A.; Lorenz, J.; Short, D.

2015-01-01

We present a software framework for statistical data analysis, called HistFitter, that has been used extensively by the ATLAS Collaboration to analyze big datasets originating from proton-proton collisions at the Large Hadron Collider at CERN. Since 2012 HistFitter has been the standard statistical tool in searches for supersymmetric particles performed by ATLAS. HistFitter is a programmable and flexible framework to build, book-keep, fit, interpret and present results of data models of nearly arbitrary complexity. Starting from an object-oriented configuration, defined by users, the framework builds probability density functions that are automatically fit to data and interpreted with statistical tests. Internally HistFitter uses the statistics packages RooStats and HistFactory. A key innovation of HistFitter is its design, which is rooted in analysis strategies of particle physics. The concepts of control, signal and validation regions are woven into its fabric. These are progressively treated with statistically rigorous built-in methods. Being capable of working with multiple models at once that describe the data, HistFitter introduces an additional level of abstraction that allows for easy bookkeeping, manipulation and testing of large collections of signal hypotheses. Finally, HistFitter provides a collection of tools to present results with publication quality style through a simple command-line interface. (orig.)
Estimation of global network statistics from incomplete data.

Directory of Open Access Journals (Sweden)

Catherine A Bliss

Full Text Available Complex networks underlie an enormous variety of social, biological, physical, and virtual systems. A profound complication for the science of complex networks is that in most cases, observing all nodes and all network interactions is impossible. Previous work addressing the impacts of partial network data is surprisingly limited, focuses primarily on missing nodes, and suggests that network statistics derived from subsampled data are not suitable estimators for the same network statistics describing the overall network topology. We generate scaling methods to predict true network statistics, including the degree distribution, from only partial knowledge of nodes, links, or weights. Our methods are transparent and do not assume a known generating process for the network, thus enabling prediction of network statistics for a wide variety of applications. We validate analytical results on four simulated network classes and empirical data sets of various sizes. We perform subsampling experiments by varying proportions of sampled data and demonstrate that our scaling methods can provide very good estimates of true network statistics while acknowledging limits. Lastly, we apply our techniques to a set of rich and evolving large-scale social networks, Twitter reply networks. Based on 100 million tweets, we use our scaling techniques to propose a statistical characterization of the Twitter Interactome from September 2008 to November 2008. Our treatment allows us to find support for Dunbar's hypothesis in detecting an upper threshold for the number of active social contacts that individuals maintain over the course of one week.
Gas, electricity, coal: 1998 statistical data

International Nuclear Information System (INIS)

1999-01-01

This document brings together the main statistical data from the French direction of gas, electricity and coal and presents a selection of the most significant numbered data: origin of production, share of the consumption, price levels, resources-employment status. These data are presented in a synthetic and accessible way in order to make useful references for the actors of the energy sector. (J.S.)
Statistical Literacy in the Data Science Workplace

Science.gov (United States)

Grant, Robert

2017-01-01

Statistical literacy, the ability to understand and make use of statistical information including methods, has particular relevance in the age of data science, when complex analyses are undertaken by teams from diverse backgrounds. Not only is it essential to communicate to the consumers of information but also within the team. Writing from the…
HistFitter software framework for statistical data analysis

CERN Document Server

Baak, M.; Côte, D.; Koutsman, A.; Lorenz, J.; Short, D.

2015-01-01

We present a software framework for statistical data analysis, called HistFitter, that has been used extensively by the ATLAS Collaboration to analyze big datasets originating from proton-proton collisions at the Large Hadron Collider at CERN. Since 2012 HistFitter has been the standard statistical tool in searches for supersymmetric particles performed by ATLAS. HistFitter is a programmable and flexible framework to build, book-keep, fit, interpret and present results of data models of nearly arbitrary complexity. Starting from an object-oriented configuration, defined by users, the framework builds probability density functions that are automatically fitted to data and interpreted with statistical tests. A key innovation of HistFitter is its design, which is rooted in core analysis strategies of particle physics. The concepts of control, signal and validation regions are woven into its very fabric. These are progressively treated with statistically rigorous built-in methods. Being capable of working with mu...
Statistical and Machine-Learning Data Mining Techniques for Better Predictive Modeling and Analysis of Big Data

CERN Document Server

Ratner, Bruce

2011-01-01

The second edition of a bestseller, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data is still the only book, to date, to distinguish between statistical data mining and machine-learning data mining. The first edition, titled Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data, contained 17 chapters of innovative and practical statistical data mining techniques. In this second edition, renamed to reflect the increased coverage of machine-learning data mining techniques, the author has
Fetal Alcohol Spectrum Disorders (FASDs): Data and Statistics

Science.gov (United States)

... alcohol screening and counseling for all women Data & Statistics Recommend on Facebook Tweet Share Compartir Prevalence of ... conducted annually by the National Center for Health Statistics (NCHS), CDC, to produce national estimates for a ...
Statistical data processing with automatic system for environmental radiation monitoring

International Nuclear Information System (INIS)

Zarkh, V.G.; Ostroglyadov, S.V.

1986-01-01

Practice of statistical data processing for radiation monitoring is exemplified, and some results obtained are presented. Experience in practical application of mathematical statistics methods for radiation monitoring data processing allowed to develop a concrete algorithm of statistical processing realized in M-6000 minicomputer. The suggested algorithm by its content is divided into 3 parts: parametrical data processing and hypotheses test, pair and multiple correlation analysis. Statistical processing programms are in a dialogue operation. The above algorithm was used to process observed data over radioactive waste disposal control region. Results of surface waters monitoring processing are presented
Statistics and Data Interpretation for Social Work

CERN Document Server

Rosenthal, James

2011-01-01

"Without question, this text will be the most authoritative source of information on statistics in the human services. From my point of view, it is a definitive work that combines a rigorous pedagogy with a down to earth (commonsense) exploration of the complex and difficult issues in data analysis (statistics) and interpretation. I welcome its publication.". -Praise for the First Edition. Written by a social worker for social work students, this is a nuts and bolts guide to statistics that presents complex calculations and concepts in clear, easy-to-understand language. It includes

Workshop statistics discovery with data and Minitab

CERN Document Server

Rossman, Allan J

1998-01-01

Shorn of all subtlety and led naked out of the protec tive fold of educational research literature, there comes a sheepish little fact: lectures don't work nearly as well as many of us would like to think. -George Cobb (1992) This book contains activities that guide students to discover statistical concepts, explore statistical principles, and apply statistical techniques. Students work toward these goals through the analysis of genuine data and through inter action with one another, with their instructor, and with technology. Providing a one-semester introduction to fundamental ideas of statistics for college and advanced high school students, Warkshop Statistics is designed for courses that employ an interactive learning environment by replacing lectures with hands on activities. The text contains enough expository material to stand alone, but it can also be used to supplement a more traditional textbook. Some distinguishing features of Workshop Statistics are its emphases on active learning, conceptu...
Development of statistical analysis code for meteorological data (W-View)

International Nuclear Information System (INIS)

Tachibana, Haruo; Sekita, Tsutomu; Yamaguchi, Takenori

2003-03-01

A computer code (W-View: Weather View) was developed to analyze the meteorological data statistically based on 'the guideline of meteorological statistics for the safety analysis of nuclear power reactor' (Nuclear Safety Commission on January 28, 1982; revised on March 29, 2001). The code gives statistical meteorological data to assess the public dose in case of normal operation and severe accident to get the license of nuclear reactor operation. This code was revised from the original code used in a large office computer code to enable a personal computer user to analyze the meteorological data simply and conveniently and to make the statistical data tables and figures of meteorology. (author)
Application of descriptive statistics in analysis of experimental data

OpenAIRE

Mirilović Milorad; Pejin Ivana

2008-01-01

Statistics today represent a group of scientific methods for the quantitative and qualitative investigation of variations in mass appearances. In fact, statistics present a group of methods that are used for the accumulation, analysis, presentation and interpretation of data necessary for reaching certain conclusions. Statistical analysis is divided into descriptive statistical analysis and inferential statistics. The values which represent the results of an experiment, and which are the subj...
Development of statistical analysis code for meteorological data (W-View)

Energy Technology Data Exchange (ETDEWEB)

Tachibana, Haruo; Sekita, Tsutomu; Yamaguchi, Takenori [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment

2003-03-01

A computer code (W-View: Weather View) was developed to analyze the meteorological data statistically based on 'the guideline of meteorological statistics for the safety analysis of nuclear power reactor' (Nuclear Safety Commission on January 28, 1982; revised on March 29, 2001). The code gives statistical meteorological data to assess the public dose in case of normal operation and severe accident to get the license of nuclear reactor operation. This code was revised from the original code used in a large office computer code to enable a personal computer user to analyze the meteorological data simply and conveniently and to make the statistical data tables and figures of meteorology. (author)
The value of statistical tools to detect data fabrication

NARCIS (Netherlands)

Hartgerink, C.H.J.; Wicherts, J.M.; van Assen, M.A.L.M.

2016-01-01

We aim to investigate how statistical tools can help detect potential data fabrication in the social- and medical sciences. In this proposal we outline three projects to assess the value of such statistical tools to detect potential data fabrication and make the first steps in order to apply them
Topology for Statistical Modeling of Petascale Data

Energy Technology Data Exchange (ETDEWEB)

Bennett, Janine Camille [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Pebay, Philippe Pierre [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Pascucci, Valerio [Univ. of Utah, Salt Lake City, UT (United States); Levine, Joshua [Univ. of Utah, Salt Lake City, UT (United States); Gyulassy, Attila [Univ. of Utah, Salt Lake City, UT (United States); Rojas, Maurice [Texas A & M Univ., College Station, TX (United States)

2014-07-01

This document presents current technical progress and dissemination of results for the Mathematics for Analysis of Petascale Data (MAPD) project titled "Topology for Statistical Modeling of Petascale Data", funded by the Office of Science Advanced Scientific Computing Research (ASCR) Applied Math program.
Statistical processing of technological and radiochemical data

International Nuclear Information System (INIS)

Lahodova, Zdena; Vonkova, Kateřina

2011-01-01

The project described in this article had two goals. The main goal was to compare technological and radiochemical data from two units of nuclear power plant. The other goal was to check the collection, organization and interpretation of routinely measured data. Monitoring of analytical and radiochemical data is a very valuable source of knowledge for some processes in the primary circuit. Exploratory analysis of one-dimensional data was performed to estimate location and variability and to find extreme values, data trends, distribution, autocorrelation etc. This process allowed for the cleaning and completion of raw data. Then multiple analyses such as multiple comparisons, multiple correlation, variance analysis, and so on were performed. Measured data was organized into a data matrix. The results and graphs such as Box plots, Mahalanobis distance, Biplot, Correlation, and Trend graphs are presented in this article as statistical analysis tools. Tables of data were replaced with graphs because graphs condense large amounts of information into easy-to-understand formats. The significant conclusion of this work is that the collection and comprehension of data is a very substantial part of statistical processing. With well-prepared and well-understood data, its accurate evaluation is possible. Cooperation between the technicians who collect data and the statistician who processes it is also very important. (author)
STATCAT, Statistical Analysis of Parametric and Non-Parametric Data

International Nuclear Information System (INIS)

David, Hugh

1990-01-01

1 - Description of program or function: A suite of 26 programs designed to facilitate the appropriate statistical analysis and data handling of parametric and non-parametric data, using classical and modern univariate and multivariate methods. 2 - Method of solution: Data is read entry by entry, using a choice of input formats, and the resultant data bank is checked for out-of- range, rare, extreme or missing data. The completed STATCAT data bank can be treated by a variety of descriptive and inferential statistical methods, and modified, using other standard programs as required
Using Data from Climate Science to Teach Introductory Statistics

Science.gov (United States)

Witt, Gary

2013-01-01

This paper shows how the application of simple statistical methods can reveal to students important insights from climate data. While the popular press is filled with contradictory opinions about climate science, teachers can encourage students to use introductory-level statistics to analyze data for themselves on this important issue in public…
Symbolic Data Analysis Conceptual Statistics and Data Mining

CERN Document Server

Billard, Lynne

2012-01-01

With the advent of computers, very large datasets have become routine. Standard statistical methods don't have the power or flexibility to analyse these efficiently, and extract the required knowledge. An alternative approach is to summarize a large dataset in such a way that the resulting summary dataset is of a manageable size and yet retains as much of the knowledge in the original dataset as possible. One consequence of this is that the data may no longer be formatted as single values, but be represented by lists, intervals, distributions, etc. The summarized data have their own internal s
Feature-Based Statistical Analysis of Combustion Simulation Data

Energy Technology Data Exchange (ETDEWEB)

Bennett, J; Krishnamoorthy, V; Liu, S; Grout, R; Hawkes, E; Chen, J; Pascucci, V; Bremer, P T

2011-11-18

We present a new framework for feature-based statistical analysis of large-scale scientific data and demonstrate its effectiveness by analyzing features from Direct Numerical Simulations (DNS) of turbulent combustion. Turbulent flows are ubiquitous and account for transport and mixing processes in combustion, astrophysics, fusion, and climate modeling among other disciplines. They are also characterized by coherent structure or organized motion, i.e. nonlocal entities whose geometrical features can directly impact molecular mixing and reactive processes. While traditional multi-point statistics provide correlative information, they lack nonlocal structural information, and hence, fail to provide mechanistic causality information between organized fluid motion and mixing and reactive processes. Hence, it is of great interest to capture and track flow features and their statistics together with their correlation with relevant scalar quantities, e.g. temperature or species concentrations. In our approach we encode the set of all possible flow features by pre-computing merge trees augmented with attributes, such as statistical moments of various scalar fields, e.g. temperature, as well as length-scales computed via spectral analysis. The computation is performed in an efficient streaming manner in a pre-processing step and results in a collection of meta-data that is orders of magnitude smaller than the original simulation data. This meta-data is sufficient to support a fully flexible and interactive analysis of the features, allowing for arbitrary thresholds, providing per-feature statistics, and creating various global diagnostics such as Cumulative Density Functions (CDFs), histograms, or time-series. We combine the analysis with a rendering of the features in a linked-view browser that enables scientists to interactively explore, visualize, and analyze the equivalent of one terabyte of simulation data. We highlight the utility of this new framework for combustion
Vapor Pressure Data Analysis and Statistics

Science.gov (United States)

2016-12-01

near 8, 2000, and 200, respectively. The A (or a) value is directly related to vapor pressure and will be greater for high vapor pressure materials...1, (10) where n is the number of data points, Yi is the natural logarithm of the i th experimental vapor pressure value, and Xi is the...VAPOR PRESSURE DATA ANALYSIS AND STATISTICS ECBC-TR-1422 Ann Brozena RESEARCH AND TECHNOLOGY DIRECTORATE
Data management and statistical analysis for environmental assessment

International Nuclear Information System (INIS)

Wendelberger, J.R.; McVittie, T.I.

1995-01-01

Data management and statistical analysis for environmental assessment are important issues on the interface of computer science and statistics. Data collection for environmental decision making can generate large quantities of various types of data. A database/GIS system developed is described which provides efficient data storage as well as visualization tools which may be integrated into the data analysis process. FIMAD is a living database and GIS system. The system has changed and developed over time to meet the needs of the Los Alamos National Laboratory Restoration Program. The system provides a repository for data which may be accessed by different individuals for different purposes. The database structure is driven by the large amount and varied types of data required for environmental assessment. The integration of the database with the GIS system provides the foundation for powerful visualization and analysis capabilities
Numeric computation and statistical data analysis on the Java platform

CERN Document Server

Chekanov, Sergei V

2016-01-01

Numerical computation, knowledge discovery and statistical data analysis integrated with powerful 2D and 3D graphics for visualization are the key topics of this book. The Python code examples powered by the Java platform can easily be transformed to other programming languages, such as Java, Groovy, Ruby and BeanShell. This book equips the reader with a computational platform which, unlike other statistical programs, is not limited by a single programming language. The author focuses on practical programming aspects and covers a broad range of topics, from basic introduction to the Python language on the Java platform (Jython), to descriptive statistics, symbolic calculations, neural networks, non-linear regression analysis and many other data-mining topics. He discusses how to find regularities in real-world data, how to classify data, and how to process data for knowledge discoveries. The code snippets are so short that they easily fit into single pages. Numeric Computation and Statistical Data Analysis ...
Experimental uncertainty estimation and statistics for data having interval uncertainty.

Energy Technology Data Exchange (ETDEWEB)

Kreinovich, Vladik (Applied Biomathematics, Setauket, New York); Oberkampf, William Louis (Applied Biomathematics, Setauket, New York); Ginzburg, Lev (Applied Biomathematics, Setauket, New York); Ferson, Scott (Applied Biomathematics, Setauket, New York); Hajagos, Janos (Applied Biomathematics, Setauket, New York)

2007-05-01

This report addresses the characterization of measurements that include epistemic uncertainties in the form of intervals. It reviews the application of basic descriptive statistics to data sets which contain intervals rather than exclusively point estimates. It describes algorithms to compute various means, the median and other percentiles, variance, interquartile range, moments, confidence limits, and other important statistics and summarizes the computability of these statistics as a function of sample size and characteristics of the intervals in the data (degree of overlap, size and regularity of widths, etc.). It also reviews the prospects for analyzing such data sets with the methods of inferential statistics such as outlier detection and regressions. The report explores the tradeoff between measurement precision and sample size in statistical results that are sensitive to both. It also argues that an approach based on interval statistics could be a reasonable alternative to current standard methods for evaluating, expressing and propagating measurement uncertainties.
Statistical and Visualization Data Mining Tools for Foundry Production

Directory of Open Access Journals (Sweden)

M. Perzyk

2007-07-01

Full Text Available In recent years a rapid development of a new, interdisciplinary knowledge area, called data mining, is observed. Its main task is extracting useful information from previously collected large amount of data. The main possibilities and potential applications of data mining in manufacturing industry are characterized. The main types of data mining techniques are briefly discussed, including statistical, artificial intelligence, data base and visualization tools. The statistical methods and visualization methods are presented in more detail, showing their general possibilities, advantages as well as characteristic examples of applications in foundry production. Results of the author’s research are presented, aimed at validation of selected statistical tools which can be easily and effectively used in manufacturing industry. A performance analysis of ANOVA and contingency tables based methods, dedicated for determination of the most significant process parameters as well as for detection of possible interactions among them, has been made. Several numerical tests have been performed using simulated data sets, with assumed hidden relationships as well some real data, related to the strength of ductile cast iron, collected in a foundry. It is concluded that the statistical methods offer relatively easy and fairly reliable tools for extraction of that type of knowledge about foundry manufacturing processes. However, further research is needed, aimed at explanation of some imperfections of the investigated tools as well assessment of their validity for more complex tasks.
Register-based statistics statistical methods for administrative data

CERN Document Server

Wallgren, Anders

2014-01-01

This book provides a comprehensive and up to date treatment of theory and practical implementation in Register-based statistics. It begins by defining the area, before explaining how to structure such systems, as well as detailing alternative approaches. It explains how to create statistical registers, how to implement quality assurance, and the use of IT systems for register-based statistics. Further to this, clear details are given about the practicalities of implementing such statistical methods, such as protection of privacy and the coordination and coherence of such an undertaking. Thi
Using Facebook Data to Turn Introductory Statistics Students into Consultants

Science.gov (United States)

Childers, Adam F.

2017-01-01

Facebook provides businesses and organizations with copious data that describe how users are interacting with their page. This data affords an excellent opportunity to turn introductory statistics students into consultants to analyze the Facebook data using descriptive and inferential statistics. This paper details a semester-long project that…
Simple statistical methods for software engineering data and patterns

CERN Document Server

Pandian, C Ravindranath

2015-01-01

Although there are countless books on statistics, few are dedicated to the application of statistical methods to software engineering. Simple Statistical Methods for Software Engineering: Data and Patterns fills that void. Instead of delving into overly complex statistics, the book details simpler solutions that are just as effective and connect with the intuition of problem solvers.Sharing valuable insights into software engineering problems and solutions, the book not only explains the required statistical methods, but also provides many examples, review questions, and case studies that prov
Statistical Data Processing with R – Metadata Driven Approach

Directory of Open Access Journals (Sweden)

Rudi SELJAK

2016-06-01

Full Text Available In recent years the Statistical Office of the Republic of Slovenia has put a lot of effort into re-designing its statistical process. We replaced the classical stove-pipe oriented production system with general software solutions, based on the metadata driven approach. This means that one general program code, which is parametrized with process metadata, is used for data processing for a particular survey. Currently, the general program code is entirely based on SAS macros, but in the future we would like to explore how successfully statistical software R can be used for this approach. Paper describes the metadata driven principle for data validation, generic software solution and main issues connected with the use of statistical software R for this approach.

Statistical Data Analyses of Trace Chemical, Biochemical, and Physical Analytical Signatures

Energy Technology Data Exchange (ETDEWEB)

Udey, Ruth Norma [Michigan State Univ., East Lansing, MI (United States)

2013-01-01

Analytical and bioanalytical chemistry measurement results are most meaningful when interpreted using rigorous statistical treatments of the data. The same data set may provide many dimensions of information depending on the questions asked through the applied statistical methods. Three principal projects illustrated the wealth of information gained through the application of statistical data analyses to diverse problems.
Longitudinal data analysis a handbook of modern statistical methods

CERN Document Server

Fitzmaurice, Garrett; Verbeke, Geert; Molenberghs, Geert

2008-01-01

Although many books currently available describe statistical models and methods for analyzing longitudinal data, they do not highlight connections between various research threads in the statistical literature. Responding to this void, Longitudinal Data Analysis provides a clear, comprehensive, and unified overview of state-of-the-art theory and applications. It also focuses on the assorted challenges that arise in analyzing longitudinal data. After discussing historical aspects, leading researchers explore four broad themes: parametric modeling, nonparametric and semiparametric methods, joint
Security of statistical data bases: invasion of privacy through attribute correlational modeling

Energy Technology Data Exchange (ETDEWEB)

Palley, M.A.

1985-01-01

This study develops, defines, and applies a statistical technique for the compromise of confidential information in a statistical data base. Attribute Correlational Modeling (ACM) recognizes that the information contained in a statistical data base represents real world statistical phenomena. As such, ACM assumes correlational behavior among the database attributes. ACM proceeds to compromise confidential information through creation of a regression model, where the confidential attribute is treated as the dependent variable. The typical statistical data base may preclude the direct application of regression. In this scenario, the research introduces the notion of a synthetic data base, created through legitimate queries of the actual data base, and through proportional random variation of responses to these queries. The synthetic data base is constructed to resemble the actual data base as closely as possible in a statistical sense. ACM then applies regression analysis to the synthetic data base, and utilizes the derived model to estimate confidential information in the actual database.
Robust statistics and geochemical data analysis

International Nuclear Information System (INIS)

Di, Z.

1987-01-01

Advantages of robust procedures over ordinary least-squares procedures in geochemical data analysis is demonstrated using NURE data from the Hot Springs Quadrangle, South Dakota, USA. Robust principal components analysis with 5% multivariate trimming successfully guarded the analysis against perturbations by outliers and increased the number of interpretable factors. Regression with SINE estimates significantly increased the goodness-of-fit of the regression and improved the correspondence of delineated anomalies with known uranium prospects. Because of the ubiquitous existence of outliers in geochemical data, robust statistical procedures are suggested as routine procedures to replace ordinary least-squares procedures
Journal data sharing policies and statistical reporting inconsistencies in psychology.

NARCIS (Netherlands)

Nuijten, M.B.; Borghuis, J.; Veldkamp, C.L.S.; Dominguez Alvarez, L.; van Assen, M.A.L.M.; Wicherts, J.M.

2018-01-01

In this paper, we present three retrospective observational studies that investigate the relation between data sharing and statistical reporting inconsistencies. Previous research found that reluctance to share data was related to a higher prevalence of statistical errors, often in the direction of
Statistical Power Analysis with Missing Data A Structural Equation Modeling Approach

CERN Document Server

Davey, Adam

2009-01-01

Statistical power analysis has revolutionized the ways in which we conduct and evaluate research. Similar developments in the statistical analysis of incomplete (missing) data are gaining more widespread applications. This volume brings statistical power and incomplete data together under a common framework, in a way that is readily accessible to those with only an introductory familiarity with structural equation modeling. It answers many practical questions such as: How missing data affects the statistical power in a study How much power is likely with different amounts and types
IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies.

Science.gov (United States)

Dai, Mingwei; Ming, Jingsi; Cai, Mingxuan; Liu, Jin; Yang, Can; Wan, Xiang; Xu, Zongben

2017-09-15

Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as 'polygenicity'. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question. In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by i ntegrating individual level ge notype data and s ummary s tatistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% ( ±0.4% ) to 69.4% ( ±0.1% ) using about 240 000 variants. The IGESS software is available at https://github.com/daviddaigithub/IGESS . zbxu@xjtu.edu.cn or xwan@comp.hkbu.edu.hk or eeyang@hkbu.edu.hk. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Software development for statistical handling of dosimetric and epidemiological data base

International Nuclear Information System (INIS)

Amaro, M.

1990-01-01

The dose records from different groups of occupationally exposed workers are available in a computerized data base whose main purpose is the individual dose follow-up. Apart from this objective, such a dosimetric data base can be useful to obtain statistical analysis. The type of statistical n formation that can be extracted from the data base may aim to attain mainly two kinds of objectives: - Individual and collective dose distributions and statistics. -Epidemiological statistics. The report describes the software developed to obtain the statistical reports required by the Regulatory Body, as well as any other type of dose distributions or statistics to be included in epidemiological studies A Users Guide for the operators who handle this software package, and the codes listings, are also included in the report. (Author) 2 refs
Software development for statistical handling of dosimetric and epidemiological data base

International Nuclear Information System (INIS)

Amaro, M.

1990-01-01

The dose records from different group of occupationally exposed workers are available in a computerized data base whose main purpose is the individual dose follow-up. Apart from this objective, such a dosimetric data base can be useful to obtain statistical analysis. The type of statistical information that can be extracted from the data base may aim to attain mainly two kinds of obsectives: - Individual and collective dose distributions and statistics. - Epidemiological statistics. The report describes the software developed to obtain the statistical reports required by the Regulatory Body, as well as any other type of dose distributions or statistics to be included in epidsemiological studies. A Users Guide for the operators who handle this sofware package, and the codes listings, are also included in the report. (Author)
Statistical Estimators Using Jointly Administrative and Survey Data to Produce French Structural Business Statistics

Directory of Open Access Journals (Sweden)

Brion Philippe

2015-12-01

Full Text Available Using as much administrative data as possible is a general trend among most national statistical institutes. Different kinds of administrative sources, from tax authorities or other administrative bodies, are very helpful material in the production of business statistics. However, these sources often have to be completed by information collected through statistical surveys. This article describes the way Insee has implemented such a strategy in order to produce French structural business statistics. The originality of the French procedure is that administrative and survey variables are used jointly for the same enterprises, unlike the majority of multisource systems, in which the two kinds of sources generally complement each other for different categories of units. The idea is to use, as much as possible, the richness of the administrative sources combined with the timeliness of a survey, even if the latter is conducted only on a sample of enterprises. One main issue is the classification of enterprises within the NACE nomenclature, which is a cornerstone variable in producing the breakdown of the results by industry. At a given date, two values of the corresponding code may coexist: the value of the register, not necessarily up to date, and the value resulting from the data collected via the survey, but only from a sample of enterprises. Using all this information together requires the implementation of specific statistical estimators combining some properties of the difference estimators with calibration techniques. This article presents these estimators, as well as their statistical properties, and compares them with those of other methods.
Statistical Approaches to Assess Biosimilarity from Analytical Data.

Science.gov (United States)

Burdick, Richard; Coffey, Todd; Gutka, Hiten; Gratzl, Gyöngyi; Conlon, Hugh D; Huang, Chi-Ting; Boyne, Michael; Kuehne, Henriette

2017-01-01

Protein therapeutics have unique critical quality attributes (CQAs) that define their purity, potency, and safety. The analytical methods used to assess CQAs must be able to distinguish clinically meaningful differences in comparator products, and the most important CQAs should be evaluated with the most statistical rigor. High-risk CQA measurements assess the most important attributes that directly impact the clinical mechanism of action or have known implications for safety, while the moderate- to low-risk characteristics may have a lower direct impact and thereby may have a broader range to establish similarity. Statistical equivalence testing is applied for high-risk CQA measurements to establish the degree of similarity (e.g., highly similar fingerprint, highly similar, or similar) of selected attributes. Notably, some high-risk CQAs (e.g., primary sequence or disulfide bonding) are qualitative (e.g., the same as the originator or not the same) and therefore not amenable to equivalence testing. For biosimilars, an important step is the acquisition of a sufficient number of unique originator drug product lots to measure the variability in the originator drug manufacturing process and provide sufficient statistical power for the analytical data comparisons. Together, these analytical evaluations, along with PK/PD and safety data (immunogenicity), provide the data necessary to determine if the totality of the evidence warrants a designation of biosimilarity and subsequent licensure for marketing in the USA. In this paper, a case study approach is used to provide examples of analytical similarity exercises and the appropriateness of statistical approaches for the example data.
Improved custom statistics visualization for CA Performance Center data

CERN Document Server

Talevi, Iacopo

2017-01-01

The main goal of my project is to understand and experiment the possibilities that CA Performance Center (CA PC) offers for creating custom applications to display stored information through interesting visual means, such as maps. In particular, I have re-written some of the network statistics web pages in order to fetch data from new statistics modules in CA PC, which has its own API, and stop using the RRD data.
Statistical analysis of dragline monitoring data

Energy Technology Data Exchange (ETDEWEB)

Mirabediny, H.; Baafi, E.Y. [University of Tehran, Tehran (Iran)

1998-07-01

Dragline monitoring systems are normally the best tool used to collect data on the machine performance and operational parameters of a dragline operation. This paper discusses results of a time study using data from a dragline monitoring system captured over a four month period. Statistical summaries of the time study in terms of average values, standard deviation and frequency distributions showed that the mode of operation and the geological conditions have a significant influence on the dragline performance parameters. 6 refs., 14 figs., 3 tabs.
Hearing Loss in Children: Data and Statistics

Science.gov (United States)

... 5 Chapter 6 EHDI-IS Functional Standards EHDI Electronic Health Records EHDI Data Analysis and Statistical Hub (DASH) Articles & ... RSS ABOUT About CDC Jobs Funding LEGAL Policies Privacy FOIA No Fear Act OIG 1600 Clifton Road ...
Statistical Analysis of Data for Timber Strengths

DEFF Research Database (Denmark)

Sørensen, John Dalsgaard

2003-01-01

Statistical analyses are performed for material strength parameters from a large number of specimens of structural timber. Non-parametric statistical analysis and fits have been investigated for the following distribution types: Normal, Lognormal, 2 parameter Weibull and 3-parameter Weibull...... fits to the data available, especially if tail fits are used whereas the Log Normal distribution generally gives a poor fit and larger coefficients of variation, especially if tail fits are used. The implications on the reliability level of typical structural elements and on partial safety factors...... for timber are investigated....
Statistics of meteorological data at Tokai Research Establishment in JAERI

International Nuclear Information System (INIS)

Sekita, Tsutomu; Tachibana, Haruo; Matsuura, Kenichi; Yamaguchi, Takenori

2003-12-01

The meteorological observation data at Tokai site were analyzed statistically based on a 'Guideline of meteorological statistics for the safety analysis of nuclear power reactor' (Nuclear Safety Commission on January 28, 1982; revised on March 29, 2001). This report shows the meteorological analysis of wind direction, wind velocity and atmospheric stability etc. to assess the public dose around the Tokai site caused by the released gaseous radioactivity. The statistical period of meteorological data is every 5 years from 1981 to 1995. (author)
Drug safety data mining with a tree-based scan statistic.

Science.gov (United States)

Kulldorff, Martin; Dashevsky, Inna; Avery, Taliser R; Chan, Arnold K; Davis, Robert L; Graham, David; Platt, Richard; Andrade, Susan E; Boudreau, Denise; Gunter, Margaret J; Herrinton, Lisa J; Pawloski, Pamala A; Raebel, Marsha A; Roblin, Douglas; Brown, Jeffrey S

2013-05-01

In post-marketing drug safety surveillance, data mining can potentially detect rare but serious adverse events. Assessing an entire collection of drug-event pairs is traditionally performed on a predefined level of granularity. It is unknown a priori whether a drug causes a very specific or a set of related adverse events, such as mitral valve disorders, all valve disorders, or different types of heart disease. This methodological paper evaluates the tree-based scan statistic data mining method to enhance drug safety surveillance. We use a three-million-member electronic health records database from the HMO Research Network. Using the tree-based scan statistic, we assess the safety of selected antifungal and diabetes drugs, simultaneously evaluating overlapping diagnosis groups at different granularity levels, adjusting for multiple testing. Expected and observed adverse event counts were adjusted for age, sex, and health plan, producing a log likelihood ratio test statistic. Out of 732 evaluated disease groupings, 24 were statistically significant, divided among 10 non-overlapping disease categories. Five of the 10 signals are known adverse effects, four are likely due to confounding by indication, while one may warrant further investigation. The tree-based scan statistic can be successfully applied as a data mining tool in drug safety surveillance using observational data. The total number of statistical signals was modest and does not imply a causal relationship. Rather, data mining results should be used to generate candidate drug-event pairs for rigorous epidemiological studies to evaluate the individual and comparative safety profiles of drugs. Copyright © 2013 John Wiley & Sons, Ltd.
Data and Statistics on New York's Mining Resources - NYS Dept. of

Science.gov (United States)

): Search DEC D E C banner Home Â» Lands and Waters Â» Mining & Reclamation Â» Data and Statistics on New York's Mining Resources Skip to main navigation Data and Statistics on New York's Mining Resources Statistics on New York's Mining Resources: Mines in New York - Information on active mines in New York State
Data Mining and Statistics for Decision Making

CERN Document Server

Tufféry, Stéphane

2011-01-01

Data mining is the process of automatically searching large volumes of data for models and patterns using computational techniques from statistics, machine learning and information theory; it is the ideal tool for such an extraction of knowledge. Data mining is usually associated with a business or an organization's need to identify trends and profiles, allowing, for example, retailers to discover patterns on which to base marketing objectives. This book looks at both classical and recent techniques of data mining, such as clustering, discriminant analysis, logistic regression, generalized lin
Data Warehousing: How To Make Your Statistics Meaningful.

Science.gov (United States)

Flaherty, William

2001-01-01

Examines how one school district found a way to turn data collection from a disparate mountain of statistics into more useful information by using their Instructional Decision Support System. System software is explained as is how the district solved some data management challenges. (GR)

Software for statistical data analysis used in Higgs searches

International Nuclear Information System (INIS)

Gumpert, Christian; Moneta, Lorenzo; Cranmer, Kyle; Kreiss, Sven; Verkerke, Wouter

2014-01-01

The analysis and interpretation of data collected by the Large Hadron Collider (LHC) requires advanced statistical tools in order to quantify the agreement between observation and theoretical models. RooStats is a project providing a statistical framework for data analysis with the focus on discoveries, confidence intervals and combination of different measurements in both Bayesian and frequentist approaches. It employs the RooFit data modelling language where mathematical concepts such as variables, (probability density) functions and integrals are represented as C++ objects. RooStats and RooFit rely on the persistency technology of the ROOT framework. The usage of a common data format enables the concept of digital publishing of complicated likelihood functions. The statistical tools have been developed in close collaboration with the LHC experiments to ensure their applicability to real-life use cases. Numerous physics results have been produced using the RooStats tools, with the discovery of the Higgs boson by the ATLAS and CMS experiments being certainly the most popular among them. We will discuss tools currently used by LHC experiments to set exclusion limits, to derive confidence intervals and to estimate discovery significances based on frequentist statistics and the asymptotic behaviour of likelihood functions. Furthermore, new developments in RooStats and performance optimisation necessary to cope with complex models depending on more than 1000 variables will be reviewed
A statistical study on fracture toughness data of Japanese RPVS

International Nuclear Information System (INIS)

Sakai, Y.; Ogura, N.

1987-01-01

In a cooperative study for investigating fracture toughness on pressure vessel steels produced in Japan, a number of heats of ASTM A533B cl.1 and A508 cl.3 steels have been studied. Approximately 3000 fracture toughness data and 8000 mechanical properties data were obtained and filed in a computer data bank. Statistical characterization of toughness data in the transition region has been carried out using the computer data bank. Curve fitting technique for toughness data has been examined. Approach using the function to model the transition behaviours of each toughness has been applied. The aims of fitting curve technique were as follows; (1) Summarization of an enormous toughness data base to permit comparison heats, materials and testing methods; (2) Investigating the relationships among static, dynamic and arrest toughness; (3) Examining the ASME K(IR) curve statistically. The methodology used in this study for analyzing a large quantity of fracture toughness data was found to be useful for formulating a statistically based K(IR) curve. (orig./HP)
Statistical methods to evaluate thermoluminescence ionizing radiation dosimetry data

International Nuclear Information System (INIS)

Segre, Nadia; Matoso, Erika; Fagundes, Rosane Correa

2011-01-01

Ionizing radiation levels, evaluated through the exposure of CaF 2 :Dy thermoluminescence dosimeters (TLD- 200), have been monitored at Centro Experimental Aramar (CEA), located at Ipero in Sao Paulo state, Brazil, since 1991 resulting in a large amount of measurements until 2009 (more than 2,000). The data amount associated with measurements dispersion, since every process has deviation, reinforces the utilization of statistical tools to evaluate the results, procedure also imposed by the Brazilian Standard CNEN-NN-3.01/PR- 3.01-008 which regulates the radiometric environmental monitoring. Thermoluminescence ionizing radiation dosimetry data are statistically compared in order to evaluate potential CEA's activities environmental impact. The statistical tools discussed in this work are box plots, control charts and analysis of variance. (author)
General statistical data structure for epidemiologic studies of DOE workers

International Nuclear Information System (INIS)

Frome, E.L.; Hudson, D.R.

1981-01-01

Epidemiologic studies to evaluate the occupational risks associated with employment in the nuclear industry are currently being conducted by the Department of Energy. Data that have potential value in evaluating any long-term health effects of occupational exposure to low levels of radiation are obtained for each individual at a given facility. We propose a general data structure for statistical analysis that is used to define transformations from the data management system into the data analysis system. Statistical methods of interest in epidemiologic studies include contingency table analysis and survival analysis procedures that can be used to evaluate potential associations between occupational radiation exposure and mortality. The purposes of this paper are to discuss (1) the adequacy of this data structure for single- and multiple-facility analysis and (2) the statistical computing problems encountered in dealing with large populations over extended periods of time
Statistical analysis of network data with R

CERN Document Server

Kolaczyk, Eric D

2014-01-01

Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing. As such, network analysis is an important growth area in the quantitative sciences, with roots in social network analysis going back to the 1930s and graph theory going back centuries. Measurement and analysis are integral components of network research. As a result, statistical methods play a critical role in network analysis. This book is the first of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to conduct a wide range of network analyses, from basic manipulation and visualization, to summary and characterization, to modeling of network data. The central package is igraph, which provides extensive capabilities for studying network graphs in R. This text builds on Eric D. Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).
Using statistical correlation to compare geomagnetic data sets

Science.gov (United States)

Stanton, T.

2009-04-01

The major features of data curves are often matched, to a first order, by bump and wiggle matching to arrive at an offset between data sets. This poster describes a simple statistical correlation program that has proved useful during this stage by determining the optimal correlation between geomagnetic curves using a variety of fixed and floating windows. Its utility is suggested by the fact that it is simple to run, yet generates meaningful data comparisons, often when data noise precludes the obvious matching of curve features. Data sets can be scaled, smoothed, normalised and standardised, before all possible correlations are carried out between selected overlapping portions of each curve. Best-fit offset curves can then be displayed graphically. The program was used to cross-correlate directional and palaeointensity data from Holocene lake sediments (Stanton et al., submitted) and Holocene lava flows. Some example curve matches are shown, including some that illustrate the potential of this technique when examining particularly sparse data sets. Stanton, T., Snowball, I., Zillén, L. and Wastegård, S., submitted. Detecting potential errors in varve chronology and 14C ages using palaeosecular variation curves, lead pollution history and statistical correlation. Quaternary Geochronology.
A log-Weibull spatial scan statistic for time to event data.

Science.gov (United States)

Usman, Iram; Rosychuk, Rhonda J

2018-06-13

Spatial scan statistics have been used for the identification of geographic clusters of elevated numbers of cases of a condition such as disease outbreaks. These statistics accompanied by the appropriate distribution can also identify geographic areas with either longer or shorter time to events. Other authors have proposed the spatial scan statistics based on the exponential and Weibull distributions. We propose the log-Weibull as an alternative distribution for the spatial scan statistic for time to events data and compare and contrast the log-Weibull and Weibull distributions through simulation studies. The effect of type I differential censoring and power have been investigated through simulated data. Methods are also illustrated on time to specialist visit data for discharged patients presenting to emergency departments for atrial fibrillation and flutter in Alberta during 2010-2011. We found northern regions of Alberta had longer times to specialist visit than other areas. We proposed the spatial scan statistic for the log-Weibull distribution as a new approach for detecting spatial clusters for time to event data. The simulation studies suggest that the test performs well for log-Weibull data.
Statistics and data analysis for financial engineering with R examples

CERN Document Server

Ruppert, David

2015-01-01

The new edition of this influential textbook, geared towards graduate or advanced undergraduate students, teaches the statistics necessary for financial engineering. In doing so, it illustrates concepts using financial markets and economic data, R Labs with real-data exercises, and graphical and analytic methods for modeling and diagnosing modeling errors. Financial engineers now have access to enormous quantities of data. To make use of these data, the powerful methods in this book, particularly about volatility and risks, are essential. Strengths of this fully-revised edition include major additions to the R code and the advanced topics covered. Individual chapters cover, among other topics, multivariate distributions, copulas, Bayesian computations, risk management, multivariate volatility and cointegration. Suggested prerequisites are basic knowledge of statistics and probability, matrices and linear algebra, and calculus. There is an appendix on probability, statistics and linear algebra. Practicing fina...
Statistical distributions as applied to environmental surveillance data

International Nuclear Information System (INIS)

Speer, D.R.; Waite, D.A.

1975-09-01

Application of normal, log normal, and Weibull distributions to environmental surveillance data was investigated for approximately 300 nuclide-medium-year-location combinations. Corresponding W test calculations were made to determine the probability of a particular data set falling within the distribution of interest. Conclusions are drawn as to the fit of any data group to the various distributions. The significance of fitting statistical distributions to the data is discussed
Tips and Tricks for Successful Application of Statistical Methods to Biological Data.

Science.gov (United States)

Schlenker, Evelyn

2016-01-01

This chapter discusses experimental design and use of statistics to describe characteristics of data (descriptive statistics) and inferential statistics that test the hypothesis posed by the investigator. Inferential statistics, based on probability distributions, depend upon the type and distribution of the data. For data that are continuous, randomly and independently selected, as well as normally distributed more powerful parametric tests such as Student's t test and analysis of variance (ANOVA) can be used. For non-normally distributed or skewed data, transformation of the data (using logarithms) may normalize the data allowing use of parametric tests. Alternatively, with skewed data nonparametric tests can be utilized, some of which rely on data that are ranked prior to statistical analysis. Experimental designs and analyses need to balance between committing type 1 errors (false positives) and type 2 errors (false negatives). For a variety of clinical studies that determine risk or benefit, relative risk ratios (random clinical trials and cohort studies) or odds ratios (case-control studies) are utilized. Although both use 2 × 2 tables, their premise and calculations differ. Finally, special statistical methods are applied to microarray and proteomics data, since the large number of genes or proteins evaluated increase the likelihood of false discoveries. Additional studies in separate samples are used to verify microarray and proteomic data. Examples in this chapter and references are available to help continued investigation of experimental designs and appropriate data analysis.
Journal Data Sharing Policies and Statistical Reporting Inconsistencies in Psychology

Directory of Open Access Journals (Sweden)

Michèle B. Nuijten

2017-12-01

Full Text Available In this paper, we present three retrospective observational studies that investigate the relation between data sharing and statistical reporting inconsistencies. Previous research found that reluctance to share data was related to a higher prevalence of statistical errors, often in the direction of statistical significance (Wicherts, Bakker, & Molenaar, 2011. We therefore hypothesized that journal policies about data sharing and data sharing itself would reduce these inconsistencies. In Study 1, we compared the prevalence of reporting inconsistencies in two similar journals on decision making with different data sharing policies. In Study 2, we compared reporting inconsistencies in psychology articles published in PLOS journals (with a data sharing policy and Frontiers in Psychology (without a stipulated data sharing policy. In Study 3, we looked at papers published in the journal Psychological Science to check whether papers with or without an Open Practice Badge differed in the prevalence of reporting errors. Overall, we found no relationship between data sharing and reporting inconsistencies. We did find that journal policies on data sharing seem extremely effective in promoting data sharing. We argue that open data is essential in improving the quality of psychological science, and we discuss ways to detect and reduce reporting inconsistencies in the literature.
Solar radiation data - statistical analysis and simulation models

Energy Technology Data Exchange (ETDEWEB)

Mustacchi, C; Cena, V; Rocchi, M; Haghigat, F

1984-01-01

The activities consisted in collecting meteorological data on magnetic tape for ten european locations (with latitudes ranging from 42/sup 0/ to 56/sup 0/ N), analysing the multi-year sequences, developing mathematical models to generate synthetic sequences having the same statistical properties of the original data sets, and producing one or more Short Reference Years (SRY's) for each location. The meteorological parameters examinated were (for all the locations) global + diffuse radiation on horizontal surface, dry bulb temperature, sunshine duration. For some of the locations additional parameters were available, namely, global, beam and diffuse radiation on surfaces other than horizontal, wet bulb temperature, wind velocity, cloud type, cloud cover. The statistical properties investigated were mean, variance, autocorrelation, crosscorrelation with selected parameters, probability density function. For all the meteorological parameters, various mathematical models were built: linear regression, stochastic models of the AR and the DAR type. In each case, the model with the best statistical behaviour was selected for the production of a SRY for the relevant parameter/location.
Testing independence of bivariate interval-censored data using modified Kendall's tau statistic.

Science.gov (United States)

Kim, Yuneung; Lim, Johan; Park, DoHwan

2015-11-01

In this paper, we study a nonparametric procedure to test independence of bivariate interval censored data; for both current status data (case 1 interval-censored data) and case 2 interval-censored data. To do it, we propose a score-based modification of the Kendall's tau statistic for bivariate interval-censored data. Our modification defines the Kendall's tau statistic with expected numbers of concordant and disconcordant pairs of data. The performance of the modified approach is illustrated by simulation studies and application to the AIDS study. We compare our method to alternative approaches such as the two-stage estimation method by Sun et al. (Scandinavian Journal of Statistics, 2006) and the multiple imputation method by Betensky and Finkelstein (Statistics in Medicine, 1999b). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Analyzing sickness absence with statistical models for survival data

DEFF Research Database (Denmark)

Christensen, Karl Bang; Andersen, Per Kragh; Smith-Hansen, Lars

2007-01-01

OBJECTIVES: Sickness absence is the outcome in many epidemiologic studies and is often based on summary measures such as the number of sickness absences per year. In this study the use of modern statistical methods was examined by making better use of the available information. Since sickness...... absence data deal with events occurring over time, the use of statistical models for survival data has been reviewed, and the use of frailty models has been proposed for the analysis of such data. METHODS: Three methods for analyzing data on sickness absences were compared using a simulation study...... involving the following: (i) Poisson regression using a single outcome variable (number of sickness absences), (ii) analysis of time to first event using the Cox proportional hazards model, and (iii) frailty models, which are random effects proportional hazards models. Data from a study of the relation...
Statistical yearbook 2005. Data available as of March 2006. 50 ed

International Nuclear Information System (INIS)

2006-08-01

The Statistical Yearbook is an annual compilation of a wide range of international economic, social and environmental statistics on over 200 countries and areas, compiled from sources including UN agencies and other international, national and specialized organizations. The 50th issue contains data available to the Statistics Division as of March 2006 and presents them in 76 tables. The number of years of data shown in the tables varies from one to ten, with the ten-year tables covering 1994 to 2003 or 1995 to 2004. Accompanying the tables are technical notes providing brief descriptions of major statistical concepts, definitions and classifications
Performing Inferential Statistics Prior to Data Collection

Science.gov (United States)

Trafimow, David; MacDonald, Justin A.

2017-01-01

Typically, in education and psychology research, the investigator collects data and subsequently performs descriptive and inferential statistics. For example, a researcher might compute group means and use the null hypothesis significance testing procedure to draw conclusions about the populations from which the groups were drawn. We propose an…
Statistical yearbook. 2000. Data available as of 31 January 2003. 47 ed

International Nuclear Information System (INIS)

2003-01-01

This is the forty-seventh issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1989-1998 or 1990-1999, using statistics available to the Statistics Division up to 30 November 2000. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources. These include the United Nations Statistics Division in the fields of national accounts, industry, energy, transport and international trade; the United Nations Statistics Division and Population Division in the field of demographic statistics; and data provided by over 20 offices of the United Nations system and international organizations in other specialized fields.United Nations agencies and other international organizations which furnished data are listed under 'Statistical sources and references' at the end of the Yearbook. Acknowledgement is gratefully made for their generous cooperation in providing data. The Statistics Division also publishes the Monthly Bulletin of Statistics, which provides a valuable complement to the Yearbook covering current international economic statistics for most countries and areas of the world and quarterly world and regional aggregates. Subscribers to the Monthly Bulletin of Statistics may also access the Bulletin on-line via the World Wide Web on Internet. MBS On-line allows time-sensitive statistics to reach users much faster than the traditional print publication. For further information see . The present issue of the Yearbook reflects a phased programme of major changes in its organization and presentation undertaken in 1990 which until then was relatively unchanged since the first issue was released in 1948. The Yearbook has also been published on CD-ROM for IBM-compatible microcomputers, since the thirty-eighth issue
Citizen Data and Official Statistics: Background Document to a Collaborative Workshop

DEFF Research Database (Denmark)

Grommé, Francisca; Ustek, Funda; Ruppert, Evelyn

2017-01-01

This working paper was written in preparation for a collaborative workshop organised for statisticians, social scientists, information and app designers and other participants inside and outside academia. The autumn 2017 workshop aimed to develop the main principles for a citizen data app...... for official statistics. Through this work we sought to conceive of a new regime of data collection in official statistics through different devices. How can we capture citizens’ meanings and intentions when they produce data? Can we develop ‘smart’ methods that do not rely on cooperating with, and data...... generated by, large tech companies, but by developing methods and data co-produced with citizens? Towards addressing these issues we developed four key concepts outlined in this document: experimentalism, citizen data, smart statistics and privacy by design. We introduced these concepts to facilitate shared...
Statistical multistep direct and statistical multistep compound models for calculations of nuclear data for applications

International Nuclear Information System (INIS)

Seeliger, D.

1993-01-01

This contribution contains a brief presentation and comparison of the different Statistical Multistep Approaches, presently available for practical nuclear data calculations. (author). 46 refs, 5 figs
Small Sample Statistics for Incomplete Nonnormal Data: Extensions of Complete Data Formulae and a Monte Carlo Comparison

Science.gov (United States)

Savalei, Victoria

2010-01-01

Incomplete nonnormal data are common occurrences in applied research. Although these 2 problems are often dealt with separately by methodologists, they often cooccur. Very little has been written about statistics appropriate for evaluating models with such data. This article extends several existing statistics for complete nonnormal data to…

Using Data Mining to Teach Applied Statistics and Correlation

Science.gov (United States)

Hartnett, Jessica L.

2016-01-01

This article describes two class activities that introduce the concept of data mining and very basic data mining analyses. Assessment data suggest that students learned some of the conceptual basics of data mining, understood some of the ethical concerns related to the practice, and were able to perform correlations via the Statistical Package for…
LSD Dimensions: Use and Reuse of Linked Statistical Data

NARCIS (Netherlands)

Meroño-Peñuela, Albert

2014-01-01

RDF Data Cube (QB) has boosted the publication of Linked Statistical Data (LSD) on the Web, making them linkable to other related datasets and concepts following the Linked Data paradigm. In this demo we present LSD Dimensions, a web based application that monitors the usage of dimensions and codes
Developing Statistical Literacy Using Real-World Data: Investigating Socioeconomic Secondary Data Resources Used in Research and Teaching

Science.gov (United States)

Carter, Jackie; Noble, Susan; Russell, Andrew; Swanson, Eric

2011-01-01

Increasing volumes of statistical data are being made available on the open web, including from the World Bank. This "data deluge" provides both opportunities and challenges. Good use of these data requires statistical literacy. This paper presents results from a project that set out to better understand how socioeconomic secondary data…
Statistical summaries of selected Iowa streamflow data through September 2013

Science.gov (United States)

Eash, David A.; O'Shea, Padraic S.; Weber, Jared R.; Nguyen, Kevin T.; Montgomery, Nicholas L.; Simonson, Adrian J.

2016-01-04

Statistical summaries of streamflow data collected at 184 streamgages in Iowa are presented in this report. All streamgages included for analysis have at least 10 years of continuous record collected before or through September 2013. This report is an update to two previously published reports that presented statistical summaries of selected Iowa streamflow data through September 1988 and September 1996. The statistical summaries include (1) monthly and annual flow durations, (2) annual exceedance probabilities of instantaneous peak discharges (flood frequencies), (3) annual exceedance probabilities of high discharges, and (4) annual nonexceedance probabilities of low discharges and seasonal low discharges. Also presented for each streamgage are graphs of the annual mean discharges, mean annual mean discharges, 50-percent annual flow-duration discharges (median flows), harmonic mean flows, mean daily mean discharges, and flow-duration curves. Two sets of statistical summaries are presented for each streamgage, which include (1) long-term statistics for the entire period of streamflow record and (2) recent-term statistics for or during the 30-year period of record from 1984 to 2013. The recent-term statistics are only calculated for streamgages with streamflow records pre-dating the 1984 water year and with at least 10 years of record during 1984–2013. The streamflow statistics in this report are not adjusted for the effects of water use; although some of this water is used consumptively, most of it is returned to the streams.
Outpatient health care statistics data warehouse--implementation.

Science.gov (United States)

Zilli, D

1999-01-01

Data warehouse implementation is assumed to be a very knowledge-demanding, expensive and long-lasting process. As such it requires senior management sponsorship, involvement of experts, a big budget and probably years of development time. Presented Outpatient Health Care Statistics Data Warehouse implementation research provides ample evidence against the infallibility of the above statements. New, inexpensive, but powerful technology, which provides outstanding platform for On-Line Analytical Processing (OLAP), has emerged recently. Presumably, it will be the basis for the estimated future growth of data warehouse market, both in the medical and in other business fields. Methods and tools for building, maintaining and exploiting data warehouses are also briefly discussed in the paper.
42 CFR 417.806 - Financial records, statistical data, and cost finding.

Science.gov (United States)

2010-10-01

... 42 Public Health 3 2010-10-01 2010-10-01 false Financial records, statistical data, and cost... MEDICAL PLANS, AND HEALTH CARE PREPAYMENT PLANS Health Care Prepayment Plans § 417.806 Financial records, statistical data, and cost finding. (a) The principles specified in § 417.568 apply to HCPPs, except those in...
Statistical analysis and interpolation of compositional data in materials science.

Science.gov (United States)

Pesenson, Misha Z; Suram, Santosh K; Gregoire, John M

2015-02-09

Compositional data are ubiquitous in chemistry and materials science: analysis of elements in multicomponent systems, combinatorial problems, etc., lead to data that are non-negative and sum to a constant (for example, atomic concentrations). The constant sum constraint restricts the sampling space to a simplex instead of the usual Euclidean space. Since statistical measures such as mean and standard deviation are defined for the Euclidean space, traditional correlation studies, multivariate analysis, and hypothesis testing may lead to erroneous dependencies and incorrect inferences when applied to compositional data. Furthermore, composition measurements that are used for data analytics may not include all of the elements contained in the material; that is, the measurements may be subcompositions of a higher-dimensional parent composition. Physically meaningful statistical analysis must yield results that are invariant under the number of composition elements, requiring the application of specialized statistical tools. We present specifics and subtleties of compositional data processing through discussion of illustrative examples. We introduce basic concepts, terminology, and methods required for the analysis of compositional data and utilize them for the spatial interpolation of composition in a sputtered thin film. The results demonstrate the importance of this mathematical framework for compositional data analysis (CDA) in the fields of materials science and chemistry.
A note on the kappa statistic for clustered dichotomous data.

Science.gov (United States)

Zhou, Ming; Yang, Zhao

2014-06-30

The kappa statistic is widely used to assess the agreement between two raters. Motivated by a simulation-based cluster bootstrap method to calculate the variance of the kappa statistic for clustered physician-patients dichotomous data, we investigate its special correlation structure and develop a new simple and efficient data generation algorithm. For the clustered physician-patients dichotomous data, based on the delta method and its special covariance structure, we propose a semi-parametric variance estimator for the kappa statistic. An extensive Monte Carlo simulation study is performed to evaluate the performance of the new proposal and five existing methods with respect to the empirical coverage probability, root-mean-square error, and average width of the 95% confidence interval for the kappa statistic. The variance estimator ignoring the dependence within a cluster is generally inappropriate, and the variance estimators from the new proposal, bootstrap-based methods, and the sampling-based delta method perform reasonably well for at least a moderately large number of clusters (e.g., the number of clusters K ⩾50). The new proposal and sampling-based delta method provide convenient tools for efficient computations and non-simulation-based alternatives to the existing bootstrap-based methods. Moreover, the new proposal has acceptable performance even when the number of clusters is as small as K = 25. To illustrate the practical application of all the methods, one psychiatric research data and two simulated clustered physician-patients dichotomous data are analyzed. Copyright © 2014 John Wiley & Sons, Ltd.
Using Carbon Emissions Data to "Heat Up" Descriptive Statistics

Science.gov (United States)

Brooks, Robert

2012-01-01

This article illustrates using carbon emissions data in an introductory statistics assignment. The carbon emissions data has desirable characteristics including: choice of measure; skewness; and outliers. These complexities allow research and public policy debate to be introduced. (Contains 4 figures and 2 tables.)
Statistical methods for longitudinal data with agricultural applications

DEFF Research Database (Denmark)

Anantharama Ankinakatte, Smitha

The PhD study focuses on modeling two kings of longitudinal data arising in agricultural applications: continuous time series data and discrete longitudinal data. Firstly, two statistical methods, neural networks and generalized additive models, are applied to predict masistis using multivariate...... algorithm. This was found to compare favourably with the algorithm implemented in the well-known Beagle software. Finally, an R package to apply APFA models developed as part of the PhD project is described...
Applied systems ecology: models, data, and statistical methods

Energy Technology Data Exchange (ETDEWEB)

Eberhardt, L L

1976-01-01

In this report, systems ecology is largely equated to mathematical or computer simulation modelling. The need for models in ecology stems from the necessity to have an integrative device for the diversity of ecological data, much of which is observational, rather than experimental, as well as from the present lack of a theoretical structure for ecology. Different objectives in applied studies require specialized methods. The best predictive devices may be regression equations, often non-linear in form, extracted from much more detailed models. A variety of statistical aspects of modelling, including sampling, are discussed. Several aspects of population dynamics and food-chain kinetics are described, and it is suggested that the two presently separated approaches should be combined into a single theoretical framework. It is concluded that future efforts in systems ecology should emphasize actual data and statistical methods, as well as modelling.
Data base of accident and agricultural statistics for transportation risk assessment

Energy Technology Data Exchange (ETDEWEB)

Saricks, C.L.; Williams, R.G.; Hopf, M.R.

1989-11-01

A state-level data base of accident and agricultural statistics has been developed to support risk assessment for transportation of spent nuclear fuels and high-level radioactive wastes. This data base will enhance the modeling capabilities for more route-specific analyses of potential risks associated with transportation of these wastes to a disposal site. The data base and methodology used to develop state-specific accident and agricultural data bases are described, and summaries of accident and agricultural statistics are provided. 27 refs., 9 tabs.
Data base of accident and agricultural statistics for transportation risk assessment

International Nuclear Information System (INIS)

Saricks, C.L.; Williams, R.G.; Hopf, M.R.

1989-11-01

A state-level data base of accident and agricultural statistics has been developed to support risk assessment for transportation of spent nuclear fuels and high-level radioactive wastes. This data base will enhance the modeling capabilities for more route-specific analyses of potential risks associated with transportation of these wastes to a disposal site. The data base and methodology used to develop state-specific accident and agricultural data bases are described, and summaries of accident and agricultural statistics are provided. 27 refs., 9 tabs
Innovative statistical methods for public health data

CERN Document Server

Wilson, Jeffrey

2015-01-01

The book brings together experts working in public health and multi-disciplinary areas to present recent issues in statistical methodological development and their applications. This timely book will impact model development and data analyses of public health research across a wide spectrum of analysis. Data and software used in the studies are available for the reader to replicate the models and outcomes. The fifteen chapters range in focus from techniques for dealing with missing data with Bayesian estimation, health surveillance and population definition and implications in applied latent class analysis, to multiple comparison and meta-analysis in public health data. Researchers in biomedical and public health research will find this book to be a useful reference, and it can be used in graduate level classes.
Statistical application of groundwater monitoring data at the Hanford Site

International Nuclear Information System (INIS)

Chou, C.J.; Johnson, V.G.; Hodges, F.N.

1993-09-01

Effective use of groundwater monitoring data requires both statistical and geohydrologic interpretations. At the Hanford Site in south-central Washington state such interpretations are used for (1) detection monitoring, assessment monitoring, and/or corrective action at Resource Conservation and Recovery Act sites; (2) compliance testing for operational groundwater surveillance; (3) impact assessments at active liquid-waste disposal sites; and (4) cleanup decisions at Comprehensive Environmental Response Compensation and Liability Act sites. Statistical tests such as the Kolmogorov-Smirnov two-sample test are used to test the hypothesis that chemical concentrations from spatially distinct subsets or populations are identical within the uppermost unconfined aquifer. Experience at the Hanford Site in applying groundwater background data indicates that background must be considered as a statistical distribution of concentrations, rather than a single value or threshold. The use of a single numerical value as a background-based standard ignores important information and may result in excessive or unnecessary remediation. Appropriate statistical evaluation techniques include Wilcoxon rank sum test, Quantile test, ''hot spot'' comparisons, and Kolmogorov-Smirnov types of tests. Application of such tests is illustrated with several case studies derived from Hanford groundwater monitoring programs. To avoid possible misuse of such data, an understanding of the limitations is needed. In addition to statistical test procedures, geochemical, and hydrologic considerations are integral parts of the decision process. For this purpose a phased approach is recommended that proceeds from simple to the more complex, and from an overview to detailed analysis
Insights in Experimental Data : Interactive Statistics with the ILLMO Program

NARCIS (Netherlands)

Martens, J.B.O.S.

2017-01-01

Empirical researchers turn to statistics to assist them in drawing conclusions, also called inferences, from their collected data. Often, this data is experimental data, i.e., it consists of (repeated) measurements collected in one or more distinct conditions. The observed data can hence be
Dimensional enrichment of statistical linked open data

DEFF Research Database (Denmark)

Varga, Jovan; Vaisman, Alejandro; Romero, Oscar

2016-01-01

On-Line Analytical Processing (OLAP) is a data analysis technique typically used for local and well-prepared data. However, initiatives like Open Data and Open Government bring new and publicly available data on the web that are to be analyzed in the same way. The use of semantic web technologies...... for this context is especially encouraged by the Linked Data initiative. There is already a considerable amount of statistical linked open data sets published using the RDF Data Cube Vocabulary (QB) which is designed for these purposes. However, QB lacks some essential schema constructs (e.g., dimension levels......) to support OLAP. Thus, the QB4OLAP vocabulary has been proposed to extend QB with the necessary constructs and be fully compliant with OLAP. In this paper, we focus on the enrichment of an existing QB data set with QB4OLAP semantics. We first thoroughly compare the two vocabularies and outline the benefits...
Toward Global Comparability of Sexual Orientation Data in Official Statistics: A Conceptual Framework of Sexual Orientation for Health Data Collection in New Zealand's Official Statistics System

Science.gov (United States)

Gray, Alistair; Veale, Jaimie F.; Binson, Diane; Sell, Randell L.

2013-01-01

Objective. Effectively addressing health disparities experienced by sexual minority populations requires high-quality official data on sexual orientation. We developed a conceptual framework of sexual orientation to improve the quality of sexual orientation data in New Zealand's Official Statistics System. Methods. We reviewed conceptual and methodological literature, culminating in a draft framework. To improve the framework, we held focus groups and key-informant interviews with sexual minority stakeholders and producers and consumers of official statistics. An advisory board of experts provided additional guidance. Results. The framework proposes working definitions of the sexual orientation topic and measurement concepts, describes dimensions of the measurement concepts, discusses variables framing the measurement concepts, and outlines conceptual grey areas. Conclusion. The framework proposes standard definitions and concepts for the collection of official sexual orientation data in New Zealand. It presents a model for producers of official statistics in other countries, who wish to improve the quality of health data on their citizens. PMID:23840231
Common misconceptions about data analysis and statistics.

Science.gov (United States)

Motulsky, Harvey J

2014-11-01

Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason maybe that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: 1. P-Hacking. This is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want. 2. Overemphasis on P values rather than on the actual size of the observed effect. 3. Overuse of statistical hypothesis testing, and being seduced by the word "significant". 4. Overreliance on standard errors, which are often misunderstood.
Common misconceptions about data analysis and statistics.

Science.gov (United States)

Motulsky, Harvey J

2015-02-01

Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason may be that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: (1) P-Hacking. This is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want. (2) Overemphasis on P values rather than on the actual size of the observed effect. (3) Overuse of statistical hypothesis testing, and being seduced by the word "significant". (4) Overreliance on standard errors, which are often misunderstood.

Statistical methods for data analysis in particle physics

CERN Document Server

AUTHOR|(CDS)2070643

2015-01-01

This concise set of course-based notes provides the reader with the main concepts and tools to perform statistical analysis of experimental data, in particular in the field of high-energy physics (HEP). First, an introduction to probability theory and basic statistics is given, mainly as reminder from advanced undergraduate studies, yet also in view to clearly distinguish the Frequentist versus Bayesian approaches and interpretations in subsequent applications. More advanced concepts and applications are gradually introduced, culminating in the chapter on upper limits as many applications in HEP concern hypothesis testing, where often the main goal is to provide better and better limits so as to be able to distinguish eventually between competing hypotheses or to rule out some of them altogether. Many worked examples will help newcomers to the field and graduate students to understand the pitfalls in applying theoretical concepts to actual data
Study of the effects of photoelectron statistics on Thomson scattering data

International Nuclear Information System (INIS)

Hart, G.W.; Levinton, F.M.; McNeill, D.H.

1986-01-01

A computer code has been developed which simulates a Thomson scattering measurement, from the counting statistics of the input channels through the mathematical analysis of the data. The scattered and background signals in each of the wavelength channels are assumed to obey Poisson statistics, and the spectral data are fitted to a Gaussian curve using a nonlinear least-squares fitting algorithm. This method goes beyond the usual calculation of the signal-to-noise ratio for the hardware and gives a quantitative measure of the effect of the noise on the final measurement. This method is applicable to Thomson scattering measurements in which the signal-to-noise ratio is low due to either low signal or high background. Thomson scattering data from the S-1 spheromak have been compared to this simulation, and they have been found to be in good agreement. This code has proven to be useful in assessing the effects of counting statistics relative to shot-to-shot variability in producing the observed spread in the data. It was also useful for designing improvements for the S-1 Thomson scattering system, and this method would be applicable to any measurement affected by counting statistics
Explorations in Statistics: The Analysis of Ratios and Normalized Data

Science.gov (United States)

Curran-Everett, Douglas

2013-01-01

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This ninth installment of "Explorations in Statistics" explores the analysis of ratios and normalized--or standardized--data. As researchers, we compute a ratio--a numerator divided by a denominator--to compute a…
A new method to determine the number of experimental data using statistical modeling methods

Energy Technology Data Exchange (ETDEWEB)

Jung, Jung-Ho; Kang, Young-Jin; Lim, O-Kaung; Noh, Yoojeong [Pusan National University, Busan (Korea, Republic of)

2017-06-15

For analyzing the statistical performance of physical systems, statistical characteristics of physical parameters such as material properties need to be estimated by collecting experimental data. For accurate statistical modeling, many such experiments may be required, but data are usually quite limited owing to the cost and time constraints of experiments. In this study, a new method for determining a rea- sonable number of experimental data is proposed using an area metric, after obtaining statistical models using the information on the underlying distribution, the Sequential statistical modeling (SSM) approach, and the Kernel density estimation (KDE) approach. The area metric is used as a convergence criterion to determine the necessary and sufficient number of experimental data to be acquired. The pro- posed method is validated in simulations, using different statistical modeling methods, different true models, and different convergence criteria. An example data set with 29 data describing the fatigue strength coefficient of SAE 950X is used for demonstrating the performance of the obtained statistical models that use a pre-determined number of experimental data in predicting the probability of failure for a target fatigue life.
Data management in large-scale collaborative toxicity studies: how to file experimental data for automated statistical analysis.

Science.gov (United States)

Stanzel, Sven; Weimer, Marc; Kopp-Schneider, Annette

2013-06-01

High-throughput screening approaches are carried out for the toxicity assessment of a large number of chemical compounds. In such large-scale in vitro toxicity studies several hundred or thousand concentration-response experiments are conducted. The automated evaluation of concentration-response data using statistical analysis scripts saves time and yields more consistent results in comparison to data analysis performed by the use of menu-driven statistical software. Automated statistical analysis requires that concentration-response data are available in a standardised data format across all compounds. To obtain consistent data formats, a standardised data management workflow must be established, including guidelines for data storage, data handling and data extraction. In this paper two procedures for data management within large-scale toxicological projects are proposed. Both procedures are based on Microsoft Excel files as the researcher's primary data format and use a computer programme to automate the handling of data files. The first procedure assumes that data collection has not yet started whereas the second procedure can be used when data files already exist. Successful implementation of the two approaches into the European project ACuteTox is illustrated. Copyright © 2012 Elsevier Ltd. All rights reserved.
Statistical processing of experimental data

OpenAIRE

NAVRÁTIL, Pavel

2012-01-01

This thesis contains theory of probability and statistical sets. Solved and unsolved problems of probability, random variable and distributions random variable, random vector, statistical sets, regression and correlation analysis. Unsolved problems contains solutions.
Statistical yearbook. 1998. Data available as of 30 November 2000. 45 ed

International Nuclear Information System (INIS)

2001-01-01

This is the forty-fifth issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1989-1998 or 1990-1999, using statistics available to the Statistics Division up to 30 November 2000. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources. These include the United Nations Statistics Division in the fields of national accounts, industry, energy, transport and international trade; the United Nations Statistics Division and Population Division in the field of demographic statistics; and data provided by over 20 offices of the United Nations system and international organizations in other specialized fields.United Nations agencies and other international organizations which furnished data are listed under 'Statistical sources and references' at the end of the Yearbook. Acknowledgement is gratefully made for their generous cooperation in providing data. The Statistics Division also publishes the Monthly Bulletin of Statistics, which provides a valuable complement to the Yearbook covering current international economic statistics for most countries and areas of the world and quarterly world and regional aggregates. Subscribers to the Monthly Bulletin of Statistics may also access the Bulletin on-line via the World Wide Web on Internet. MBS On-line allows time-sensitive statistics to reach users much faster than the traditional print publication. For further information see . The present issue of the Yearbook reflects a phased programme of major changes in its organization and presentation undertaken in 1990 which until then was relatively unchanged since the first issue was released in 1948. One result of this process has been to reduce the total number of tables from 140 in the 37th issue to 80 in the present issue and to include
National Vital Statistics System (NVSS) - National Cardiovascular Disease Surveillance Data

Data.gov (United States)

U.S. Department of Health & Human Services — 2000 forward. NVSS is a secure, web-based data management system that collects and disseminates the Nation's official vital statistics. Indicators from this data...
Statistical methods of combining information: Applications to sensor data fusion

Energy Technology Data Exchange (ETDEWEB)

Burr, T.

1996-12-31

This paper reviews some statistical approaches to combining information from multiple sources. Promising new approaches will be described, and potential applications to combining not-so-different data sources such as sensor data will be discussed. Experiences with one real data set are described.
Development of computer-assisted instruction application for statistical data analysis android platform as learning resource

Science.gov (United States)

Hendikawati, P.; Arifudin, R.; Zahid, M. Z.

2018-03-01

This study aims to design an android Statistics Data Analysis application that can be accessed through mobile devices to making it easier for users to access. The Statistics Data Analysis application includes various topics of basic statistical along with a parametric statistics data analysis application. The output of this application system is parametric statistics data analysis that can be used for students, lecturers, and users who need the results of statistical calculations quickly and easily understood. Android application development is created using Java programming language. The server programming language uses PHP with the Code Igniter framework, and the database used MySQL. The system development methodology used is the Waterfall methodology with the stages of analysis, design, coding, testing, and implementation and system maintenance. This statistical data analysis application is expected to support statistical lecturing activities and make students easier to understand the statistical analysis of mobile devices.
Some statistical properties of gene expression clustering for array data

DEFF Research Database (Denmark)

Abreu, G C G; Pinheiro, A; Drummond, R D

2010-01-01

DNA array data without a corresponding statistical error measure. We propose an easy-to-implement and simple-to-use technique that uses bootstrap re-sampling to evaluate the statistical error of the nodes provided by SOM-based clustering. Comparisons between SOM and parametric clustering are presented...... for simulated as well as for two real data sets. We also implement a bootstrap-based pre-processing procedure for SOM, that improves the false discovery ratio of differentially expressed genes. Code in Matlab is freely available, as well as some supplementary material, at the following address: https...
Procedure for statistical analysis of one-parameter discrepant experimental data

International Nuclear Information System (INIS)

Badikov, Sergey A.; Chechev, Valery P.

2012-01-01

A new, Mandel–Paule-type procedure for statistical processing of one-parameter discrepant experimental data is described. The procedure enables one to estimate a contribution of unrecognized experimental errors into the total experimental uncertainty as well as to include it in analysis. A definition of discrepant experimental data for an arbitrary number of measurements is introduced as an accompanying result. In the case of negligible unrecognized experimental errors, the procedure simply reduces to the calculation of the weighted average and its internal uncertainty. The procedure was applied to the statistical analysis of half-life experimental data; Mean half-lives for 20 actinides were calculated and results were compared to the ENSDF and DDEP evaluations. On the whole, the calculated half-lives are consistent with the ENSDF and DDEP evaluations. However, the uncertainties calculated in this work essentially exceed the ENSDF and DDEP evaluations for discrepant experimental data. This effect can be explained by adequately taking into account unrecognized experimental errors. - Highlights: ► A new statistical procedure for processing one-parametric discrepant experimental data has been presented. ► Procedure estimates a contribution of unrecognized errors in the total experimental uncertainty. ► Procedure was applied for processing half-life discrepant experimental data. ► Results of the calculations are compared to the ENSDF and DDEP evaluations.
AutoBayes: A System for Generating Data Analysis Programs from Statistical Models

OpenAIRE

Fischer, Bernd; Schumann, Johann

2003-01-01

Data analysis is an important scientific task which is required whenever information needs to be extracted from raw data. Statistical approaches to data analysis, which use methods from probability theory and numerical analysis, are well-founded but dificult to implement: the development of a statistical data analysis program for any given application is time-consuming and requires substantial knowledge and experience in several areas. In this paper, we describe AutoBayes, a program synthesis...
A statistical test for outlier identification in data envelopment analysis

Directory of Open Access Journals (Sweden)

Morteza Khodabin

2010-09-01

Full Text Available In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results. In these ‘‘deterministic’’ frontier models, statistical theory is now mostly available. This paper deals with the statistical pared sample method and its capability of detecting outliers in data envelopment analysis. In the presented method, each observation is deleted from the sample once and the resulting linear program is solved, leading to a distribution of efficiency estimates. Based on the achieved distribution, a pared test is designed to identify the potential outlier(s. We illustrate the method through a real data set. The method could be used in a first step, as an exploratory data analysis, before using any frontier estimation.
Association testing for next-generation sequencing data using score statistics

DEFF Research Database (Denmark)

Skotte, Line; Korneliussen, Thorfinn Sand; Albrechtsen, Anders

2012-01-01

computationally feasible due to the use of score statistics. As part of the joint likelihood, we model the distribution of the phenotypes using a generalized linear model framework, which works for both quantitative and discrete phenotypes. Thus, the method presented here is applicable to case-control studies...... of genotype calls into account have been proposed; most require numerical optimization which for large-scale data is not always computationally feasible. We show that using a score statistic for the joint likelihood of observed phenotypes and observed sequencing data provides an attractive approach...... to association testing for next-generation sequencing data. The joint model accounts for the genotype classification uncertainty via the posterior probabilities of the genotypes given the observed sequencing data, which gives the approach higher power than methods based on called genotypes. This strategy remains...
Statistical methods and computing for big data

Science.gov (United States)

Wang, Chun; Chen, Ming-Hui; Schifano, Elizabeth; Wu, Jing

2016-01-01

Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay. PMID:27695593
Statistical methods and computing for big data.

Science.gov (United States)

Wang, Chun; Chen, Ming-Hui; Schifano, Elizabeth; Wu, Jing; Yan, Jun

2016-01-01

Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.
Statistical data for the tensile properties of natural fibre composites

Directory of Open Access Journals (Sweden)

J.P. Torres

2017-06-01

Full Text Available This article features a large statistical database on the tensile properties of natural fibre reinforced composite laminates. The data presented here corresponds to a comprehensive experimental testing program of several composite systems including: different material constituents (epoxy and vinyl ester resins; flax, jute and carbon fibres, different fibre configurations (short-fibre mats, unidirectional, and plain, twill and satin woven fabrics and different fibre orientations (0°, 90°, and [0,90] angle plies. For each material, ~50 specimens were tested under uniaxial tensile loading. Here, we provide the complete set of stress–strain curves together with the statistical distributions of their calculated elastic modulus, strength and failure strain. The data is also provided as support material for the research article: “The mechanical properties of natural fibre composite laminates: A statistical study” [1].
Radar Derived Spatial Statistics of Summer Rain. Volume 2; Data Reduction and Analysis

Science.gov (United States)

Konrad, T. G.; Kropfli, R. A.

1975-01-01

Data reduction and analysis procedures are discussed along with the physical and statistical descriptors used. The statistical modeling techniques are outlined and examples of the derived statistical characterization of rain cells in terms of the several physical descriptors are presented. Recommendations concerning analyses which can be pursued using the data base collected during the experiment are included.
A method for statistical comparison of data sets and its uses in analysis of nuclear physics data

International Nuclear Information System (INIS)

Bityukov, S.I.; Smirnova, V.V.; Krasnikov, N.V.; Maksimushkina, A.V.; Nikitenko, A.N.

2014-01-01

Authors propose a method for statistical comparison of two data sets. The method is based on the method of statistical comparison of histograms. As an estimator of quality of the decision made, it is proposed to use the value which it is possible to call the probability that the decision (data sets are various) is correct [ru

Statistical yearbook. 1995 Data available as of 30 June 1997. 42. ed.

International Nuclear Information System (INIS)

1997-01-01

This is the forty-second issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1985-1994 or 1986-1995, using statistics available to the Statistics Division up to 30 June 1997. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources
Statistical yearbook. 1996. Data available as of 30 September 1988. 43 ed.

International Nuclear Information System (INIS)

1999-01-01

This is the forty-third issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1986-1995 or 1987-1996, using statistics available to the Statistics Division up to 30 September 1998. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources
Toward Global Comparability of Sexual Orientation Data in Official Statistics: A Conceptual Framework of Sexual Orientation for Health Data Collection in New Zealand’s Official Statistics System

Directory of Open Access Journals (Sweden)

Frank Pega

2013-01-01

Full Text Available Objective. Effectively addressing health disparities experienced by sexual minority populations requires high-quality official data on sexual orientation. We developed a conceptual framework of sexual orientation to improve the quality of sexual orientation data in New Zealand’s Official Statistics System. Methods. We reviewed conceptual and methodological literature, culminating in a draft framework. To improve the framework, we held focus groups and key-informant interviews with sexual minority stakeholders and producers and consumers of official statistics. An advisory board of experts provided additional guidance. Results. The framework proposes working definitions of the sexual orientation topic and measurement concepts, describes dimensions of the measurement concepts, discusses variables framing the measurement concepts, and outlines conceptual grey areas. Conclusion. The framework proposes standard definitions and concepts for the collection of official sexual orientation data in New Zealand. It presents a model for producers of official statistics in other countries, who wish to improve the quality of health data on their citizens.
Reducing bias in the analysis of counting statistics data

International Nuclear Information System (INIS)

Hammersley, A.P.; Antoniadis, A.

1997-01-01

In the analysis of counting statistics data it is common practice to estimate the variance of the measured data points as the data points themselves. This practice introduces a bias into the results of further analysis which may be significant, and under certain circumstances lead to false conclusions. In the case of normal weighted least squares fitting this bias is quantified and methods to avoid it are proposed. (orig.)
Quick Access: Find Statistical Data on the Internet.

Science.gov (United States)

Su, Di

1999-01-01

Provides an annotated list of Internet sources (World Wide Web, ftp, and gopher sites) for current and historical statistical business data, including selected interest rates, the Consumer Price Index, the Producer Price Index, foreign currency exchange rates, noon buying rates, per diem rates, the special drawing right, stock quotes, and mutual…
Statistical yearbook 1993. Data available as of 31 December 1994. 40 ed.

International Nuclear Information System (INIS)

1995-01-01

This is the fortieth issue of the United Nations Statistical Yearbook, prepared by the Statistical Division, Department for Economic and Social Information and Policy Analysis of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1983-1992 or 1984-1993, using statistics available to the Statistical Division up to 31 December 1994. The Yearbook is based on data compiled by the Statistical Division from over 40 different international and national sources
Statistical yearbook 1994. Data available as of 31 March 1996. 41 ed.

International Nuclear Information System (INIS)

1996-01-01

This is the forty-first issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department for Economic and Social Information and Policy Analysis of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1984-1993 or 1985-1994, using statistics available to the Statistics Division up to 31 December 1995. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources
Bayesian statistics applied to neutron activation data for reactor flux spectrum analysis

International Nuclear Information System (INIS)

Chiesa, Davide; Previtali, Ezio; Sisti, Monica

2014-01-01

Highlights: • Bayesian statistics to analyze the neutron flux spectrum from activation data. • Rigorous statistical approach for accurate evaluation of the neutron flux groups. • Cross section and activation data uncertainties included for the problem solution. • Flexible methodology applied to analyze different nuclear reactor flux spectra. • The results are in good agreement with the MCNP simulations of neutron fluxes. - Abstract: In this paper, we present a statistical method, based on Bayesian statistics, to analyze the neutron flux spectrum from the activation data of different isotopes. The experimental data were acquired during a neutron activation experiment performed at the TRIGA Mark II reactor of Pavia University (Italy) in four irradiation positions characterized by different neutron spectra. In order to evaluate the neutron flux spectrum, subdivided in energy groups, a system of linear equations, containing the group effective cross sections and the activation rate data, has to be solved. However, since the system’s coefficients are experimental data affected by uncertainties, a rigorous statistical approach is fundamental for an accurate evaluation of the neutron flux groups. For this purpose, we applied the Bayesian statistical analysis, that allows to include the uncertainties of the coefficients and the a priori information about the neutron flux. A program for the analysis of Bayesian hierarchical models, based on Markov Chain Monte Carlo (MCMC) simulations, was used to define the problem statistical model and solve it. The first analysis involved the determination of the thermal, resonance-intermediate and fast flux components and the dependence of the results on the Prior distribution choice was investigated to confirm the reliability of the Bayesian analysis. After that, the main resonances of the activation cross sections were analyzed to implement multi-group models with finer energy subdivisions that would allow to determine the
Statistical data on butane and kerosene in West Africa

International Nuclear Information System (INIS)

Masse, R.

1990-01-01

This book gives statistical, technical and economical informations on butane and kerosene used in West Africa in 1990. In a first part, informations on gas and gas using are given: market, energy efficiency, performance, safety, distribution, storage, transport and commercialization. Statistical data on petroleum and natural gas production or consumption are also described. Natural gas and petroleum reserves in Africa are also studied. In the second part, thirty country entries give an economic analysis of each african country. 21 figs., 19 tabs., 5 maps
Statistical analysis of hydrologic data for Yucca Mountain

International Nuclear Information System (INIS)

Rutherford, B.M.; Hall, I.J.; Peters, R.R.; Easterling, R.G.; Klavetter, E.A.

1992-02-01

The geologic formations in the unsaturated zone at Yucca Mountain are currently being studied as the host rock for a potential radioactive waste repository. Data from several drill holes have been collected to provide the preliminary information needed for planning site characterization for the Yucca Mountain Project. Hydrologic properties have been measured on the core samples and the variables analyzed here are thought to be important in the determination of groundwater travel times. This report presents a statistical analysis of four hydrologic variables: saturated-matrix hydraulic conductivity, maximum moisture content, suction head, and calculated groundwater travel time. It is important to modelers to have as much information about the distribution of values of these variables as can be obtained from the data. The approach taken in this investigation is to (1) identify regions at the Yucca Mountain site that, according to the data, are distinctly different; (2) estimate the means and variances within these regions; (3) examine the relationships among the variables; and (4) investigate alternative statistical methods that might be applicable when more data become available. The five different functional stratigraphic units at three different locations are compared and grouped into relatively homogeneous regions. Within these regions, the expected values and variances associated with core samples of different sizes are estimated. The results provide a rough estimate of the distribution of hydrologic variables for small core sections within each region
The application of bayesian statistic in data fit processing

International Nuclear Information System (INIS)

Guan Xingyin; Li Zhenfu; Song Zhaohui

2010-01-01

The rationality and disadvantage of least squares fitting that is usually used in data processing is analyzed, and the theory and commonly method that Bayesian statistic is applied in data processing is shown in detail. As it is proved in analysis, Bayesian approach avoid the limitative hypothesis that least squares fitting has in data processing, and the result has traits that it is more scientific and more easily understood, may replace the least squares fitting to apply in data processing. (authors)
The German Birth Order Register - order-specific data generated from perinatal statistics and statistics on out-of-hospital births 2001-2008

OpenAIRE

Michaela Kreyenfeld; Rembrandt D. Scholz; Frederik Peters; Ines Wlosnewski

2010-01-01

Until 2008, Germany’s vital statistics did not include information on the biological order of each birth. This resulted in a dearth of important demographic indicators, such as the mean age at first birth and the level of childlessness. Researchers have tried to fill this gap by generating order-specific birth rates from survey data, and by combining survey data with vital statistics. This paper takes a different approach by using hospital statistics on births to generate birth order-specific...
75 FR 24718 - Guidance for Industry on Documenting Statistical Analysis Programs and Data Files; Availability

Science.gov (United States)

2010-05-05

...] Guidance for Industry on Documenting Statistical Analysis Programs and Data Files; Availability AGENCY... documenting statistical analyses and data files submitted to the Center for Veterinary Medicine (CVM) for the... on Documenting Statistical Analysis Programs and Data Files; Availability'' giving interested persons...
A spatial scan statistic for survival data based on Weibull distribution.

Science.gov (United States)

Bhatt, Vijaya; Tiwari, Neeraj

2014-05-20

The spatial scan statistic has been developed as a geographical cluster detection analysis tool for different types of data sets such as Bernoulli, Poisson, ordinal, normal and exponential. We propose a scan statistic for survival data based on Weibull distribution. It may also be used for other survival distributions, such as exponential, gamma, and log normal. The proposed method is applied on the survival data of tuberculosis patients for the years 2004-2005 in Nainital district of Uttarakhand, India. Simulation studies reveal that the proposed method performs well for different survival distribution functions. Copyright © 2013 John Wiley & Sons, Ltd.
Special study for the statistical evaluation of groundwater data trends. Final report

International Nuclear Information System (INIS)

1993-05-01

Analysis of trends over time in the concentrations of chemicals in groundwater at Uranium Mill Tailings Remedial Action (UMTRA) Project sites can provide valuable information for monitoring the performance of disposal cells and the effectiveness of groundwater restoration activities. Random variation in data may obscure real trends or may produce the illusion of a trend where none exists, so statistical methods are needed to reliably detect and estimate trends. Trend analysis includes both trend detection and estimation. Trend detection uses statistical hypothesis testing and provides a yes or no answer regarding the existence of a trend. Hypothesis tests try to reach a balance between false negative and false positive conclusions. To quantify the magnitude of a trend, estimation is required. This report presents the statistical concepts that are necessary for understanding trend analysis. The types of patterns most likely to occur in UMTRA data sets are emphasized. Two general approaches to analyzing data for trends are proposed and recommendations are given to assist UMTRA Project staff in selecting an appropriate method for their site data. Trend analysis is much more difficult when data contain values less than the reported laboratory detection limit. The complications that arise are explained. This report also discusses the impact of data collection procedures on statistical trend methods and offers recommendations to improve the efficiency of the methods and reduce sampling costs. Guidance for determining how many sampling rounds might be needed by statistical methods to detect trends of various magnitudes is presented. This information could be useful in planning site monitoring activities
Introduction to statistics and data analysis with exercises, solutions and applications in R

CERN Document Server

Heumann, Christian; Shalabh

2016-01-01

This introductory statistics textbook conveys the essential concepts and tools needed to develop and nurture statistical thinking. It presents descriptive, inductive and explorative statistical methods and guides the reader through the process of quantitative data analysis. In the experimental sciences and interdisciplinary research, data analysis has become an integral part of any scientific study. Issues such as judging the credibility of data, analyzing the data, evaluating the reliability of the obtained results and finally drawing the correct and appropriate conclusions from the results are vital. The text is primarily intended for undergraduate students in disciplines like business administration, the social sciences, medicine, politics, macroeconomics, etc. It features a wealth of examples, exercises and solutions with computer code in the statistical programming language R as well as supplementary material that will enable the reader to quickly adapt all methods to their own applications.
Computer processing of 14C data; statistical tests and corrections of data

International Nuclear Information System (INIS)

Obelic, B.; Planinic, J.

1977-01-01

The described computer program calculates the age of samples and performs statistical tests and corrections of data. Data are obtained from the proportional counter that measures anticoincident pulses per 20 minute intervals. After every 9th interval the counter measures total number of counts per interval. Input data are punched on cards. The output list contains input data schedule and the following results: mean CPM value, correction of CPM for normal pressure and temperature (NTP), sample age calculation based on 14 C half life of 5570 and 5730 years, age correction for NTP, dendrochronological corrections and the relative radiocarbon concentration. All results are given with one standard deviation. Input data test (Chauvenet's criterion), gas purity test, standard deviation test and test of the data processor are also included in the program. (author)
Statistical distributions as applied to environmental surveillance data

International Nuclear Information System (INIS)

Speer, D.R.; Waite, D.A.

1976-01-01

Application of normal, lognormal, and Weibull distributions to radiological environmental surveillance data was investigated for approximately 300 nuclide-medium-year-location combinations. The fit of data to distributions was compared through probability plotting (special graph paper provides a visual check) and W test calculations. Results show that 25% of the data fit the normal distribution, 50% fit the lognormal, and 90% fit the Weibull.Demonstration of how to plot each distribution shows that normal and lognormal distributions are comparatively easy to use while Weibull distribution is complicated and difficult to use. Although current practice is to use normal distribution statistics, normal fit the least number of data groups considered in this study
Statistical Physics in the Era of Big Data

Science.gov (United States)

Wang, Dashun

2013-01-01

With the wealth of data provided by a wide range of high-throughout measurement tools and technologies, statistical physics of complex systems is entering a new phase, impacting in a meaningful fashion a wide range of fields, from cell biology to computer science to economics. In this dissertation, by applying tools and techniques developed in…
On the statistical comparison of climate model output and climate data

International Nuclear Information System (INIS)

Solow, A.R.

1991-01-01

Some broad issues arising in the statistical comparison of the output of climate models with the corresponding climate data are reviewed. Particular attention is paid to the question of detecting climate change. The purpose of this paper is to review some statistical approaches to the comparison of the output of climate models with climate data. There are many statistical issues arising in such a comparison. The author will focus on some of the broader issues, although some specific methodological questions will arise along the way. One important potential application of the approaches discussed in this paper is the detection of climate change. Although much of the discussion will be fairly general, he will try to point out the appropriate connections to the detection question. 9 refs

On the statistical comparison of climate model output and climate data

International Nuclear Information System (INIS)

Solow, A.R.

1990-01-01

Some broad issues arising in the statistical comparison of the output of climate models with the corresponding climate data are reviewed. Particular attention is paid to the question of detecting climate change. The purpose of this paper is to review some statistical approaches to the comparison of the output of climate models with climate data. There are many statistical issues arising in such a comparison. The author will focus on some of the broader issues, although some specific methodological questions will arise along the way. One important potential application of the approaches discussed in this paper is the detection of climate change. Although much of the discussion will be fairly general, he will try to point out the appropriate connections to the detection question
Strategies for improving utilization of computerized statistical data by the social science community.

OpenAIRE

Robbin, Alice

1981-01-01

In recent decades there has been a notable expansion of statistical data produced by the public and private sectors for administrative, research, policy and evaluation programs. This is due to advances in relatively inexpensive and efficient data collection and management of computer-readable statistical data. Corresponding changes have not occurred in the management of data collection, preservation, description and dissemination. As a result, the process by which data become accessible to so...
Statistical mechanics of learning: A variational approach for real data

International Nuclear Information System (INIS)

Malzahn, Doerthe; Opper, Manfred

2002-01-01

Using a variational technique, we generalize the statistical physics approach of learning from random examples to make it applicable to real data. We demonstrate the validity and relevance of our method by computing approximate estimators for generalization errors that are based on training data alone
Statistical Challenges of Big Data Analysis in Medicine

Czech Academy of Sciences Publication Activity Database

Kalina, Jan

2015-01-01

Roč. 3, č. 1 (2015), s. 24-27 ISSN 1805-8698 R&D Projects: GA ČR GA13-23940S Grant - others:CESNET Development Fund(CZ) 494/2013 Institutional support: RVO:67985807 Keywords : big data * variable selection * classification * cluster analysis Subject RIV: BB - Applied Statistics, Operational Research http://www.ijbh.org/ijbh2015-1.pdf
On the statistical assessment of classifiers using DNA microarray data

Directory of Open Access Journals (Sweden)

Carella M

2006-08-01

Full Text Available Abstract Background In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22 and tumor (25 specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data. Results We estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045 as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS and Support Vector Machines (SVM classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035 and e = 18% (p = 0.037 respectively. Moreover, the error rate
Statistical analysis of field data for aircraft warranties

Science.gov (United States)

Lakey, Mary J.

Air Force and Navy maintenance data collection systems were researched to determine their scientific applicability to the warranty process. New and unique algorithms were developed to extract failure distributions which were then used to characterize how selected families of equipment typically fails. Families of similar equipment were identified in terms of function, technology and failure patterns. Statistical analyses and applications such as goodness-of-fit test, maximum likelihood estimation and derivation of confidence intervals for the probability density function parameters were applied to characterize the distributions and their failure patterns. Statistical and reliability theory, with relevance to equipment design and operational failures were also determining factors in characterizing the failure patterns of the equipment families. Inferences about the families with relevance to warranty needs were then made.
Uncertainty analysis of reactor safety systems with statistically correlated failure data

International Nuclear Information System (INIS)

Dezfuli, H.; Modarres, M.

1985-01-01

The probability of occurrence of the top event of a fault tree is estimated from failure probability of components that constitute the fault tree. Component failure probabilities are subject to statistical uncertainties. In addition, there are cases where the failure data are statistically correlated. Most fault tree evaluations have so far been based on uncorrelated component failure data. The subject of this paper is the description of a method of assessing the probability intervals for the top event failure probability of fault trees when component failure data are statistically correlated. To estimate the mean and variance of the top event, a second-order system moment method is presented through Taylor series expansion, which provides an alternative to the normally used Monte-Carlo method. For cases where component failure probabilities are statistically correlated, the Taylor expansion terms are treated properly. A moment matching technique is used to obtain the probability distribution function of the top event through fitting a Johnson Ssub(B) distribution. The computer program (CORRELATE) was developed to perform the calculations necessary for the implementation of the method developed. The CORRELATE code is very efficient and consumes minimal computer time. This is primarily because it does not employ the time-consuming Monte-Carlo method. (author)
Theoretical, analytical, and statistical interpretation of environmental data

International Nuclear Information System (INIS)

Lombard, S.M.

1974-01-01

The reliability of data from radiochemical analyses of environmental samples cannot be determined from nuclear counting statistics alone. The rigorous application of the principles of propagation of errors, an understanding of the physics and chemistry of the species of interest in the environment, and the application of information from research on the analytical procedure are all necessary for a valid estimation of the errors associated with analytical results. The specific case of the determination of plutonium in soil is considered in terms of analytical problems and data reliability. (U.S.)
Statistics in experimental design, preprocessing, and analysis of proteomics data.

Science.gov (United States)

Jung, Klaus

2011-01-01

High-throughput experiments in proteomics, such as 2-dimensional gel electrophoresis (2-DE) and mass spectrometry (MS), yield usually high-dimensional data sets of expression values for hundreds or thousands of proteins which are, however, observed on only a relatively small number of biological samples. Statistical methods for the planning and analysis of experiments are important to avoid false conclusions and to receive tenable results. In this chapter, the most frequent experimental designs for proteomics experiments are illustrated. In particular, focus is put on studies for the detection of differentially regulated proteins. Furthermore, issues of sample size planning, statistical analysis of expression levels as well as methods for data preprocessing are covered.
Data analysis using the Gnu R system for statistical computation

Energy Technology Data Exchange (ETDEWEB)

Simone, James; /Fermilab

2011-07-01

R is a language system for statistical computation. It is widely used in statistics, bioinformatics, machine learning, data mining, quantitative finance, and the analysis of clinical drug trials. Among the advantages of R are: it has become the standard language for developing statistical techniques, it is being actively developed by a large and growing global user community, it is open source software, it is highly portable (Linux, OS-X and Windows), it has a built-in documentation system, it produces high quality graphics and it is easily extensible with over four thousand extension library packages available covering statistics and applications. This report gives a very brief introduction to R with some examples using lattice QCD simulation results. It then discusses the development of R packages designed for chi-square minimization fits for lattice n-pt correlation functions.
A weighted U-statistic for genetic association analyses of sequencing data.

Science.gov (United States)

Wei, Changshuai; Li, Ming; He, Zihuai; Vsevolozhskaya, Olga; Schaid, Daniel J; Lu, Qing

2014-12-01

With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol. © 2014 WILEY PERIODICALS, INC.
Statistical Redundancy Testing for Improved Gene Selection in Cancer Classification Using Microarray Data

Directory of Open Access Journals (Sweden)

J. Sunil Rao

2007-01-01

Full Text Available In gene selection for cancer classifi cation using microarray data, we define an eigenvalue-ratio statistic to measure a gene’s contribution to the joint discriminability when this gene is included into a set of genes. Based on this eigenvalueratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene selection methods. Simulation studies illustrate the agreement between statistical redundancy testing and gene selection methods. Real data examples show the proposed gene selection methods can select a compact gene subset which can not only be used to build high quality cancer classifiers but also show biological relevance.
Statistical analysis of longitudinal quality of life data with missing measurements

NARCIS (Netherlands)

Zwinderman, A. H.

1992-01-01

The statistical analysis of longitudinal quality of life data in the presence of missing data is discussed. In cancer trials missing data are generated due to the fact that patients die, drop out, or are censored. These missing data are problematic in the monitoring of the quality of life during the
Powerful Inference With the D-Statistic on Low-Coverage Whole-Genome Data

DEFF Research Database (Denmark)

Soraggi, Samuele; Wiuf, Carsten; Albrechtsen, Anders

2018-01-01

The detection of ancient gene flow between human populations is an important issue in population genetics. A common tool for detecting ancient admixture events is the D-statistic. The D-statistic is based on the hypothesis of a genetic relationship that involves four populations, whose correctness...... is assessed by evaluating specific coincidences of alleles between the groups. When working with high throughput sequencing data calling genotypes accurately is not always possible, therefore the D-statistic currently samples a single base from the reads of one individual per population. This implies ignoring...... much of the information in the data, an issue especially striking in the case of ancient genomes. We provide a significant improvement to overcome the problems of the D-statistic by considering all reads from multiple individuals in each population. We also apply type-specific error correction...
A scan statistic for continuous data based on the normal probability model

Directory of Open Access Journals (Sweden)

Huang Lan

2009-10-01

Full Text Available Abstract Temporal, spatial and space-time scan statistics are commonly used to detect and evaluate the statistical significance of temporal and/or geographical disease clusters, without any prior assumptions on the location, time period or size of those clusters. Scan statistics are mostly used for count data, such as disease incidence or mortality. Sometimes there is an interest in looking for clusters with respect to a continuous variable, such as lead levels in children or low birth weight. For such continuous data, we present a scan statistic where the likelihood is calculated using the the normal probability model. It may also be used for other distributions, while still maintaining the correct alpha level. In an application of the new method, we look for geographical clusters of low birth weight in New York City.
Steam Generator Group Project. Progress report on data acquisition/statistical analysis

International Nuclear Information System (INIS)

Doctor, P.G.; Buchanan, J.A.; McIntyre, J.M.; Hof, P.J.; Ercanbrack, S.S.

1984-01-01

A major task of the Steam Generator Group Project (SGGP) is to establish the reliability of the eddy current inservice inspections of PWR steam generator tubing, by comparing the eddy current data to the actual physical condition of the tubes via destructive analyses. This report describes the plans for the computer systems needed to acquire, store and analyze the diverse data to be collected during the project. The real-time acquisition of the baseline eddy current inspection data will be handled using a specially designed data acquisition computer system based on a Digital Equipment Corporation (DEC) PDP-11/44. The data will be archived in digital form for use after the project is completed. Data base management and statistical analyses will be done on a DEC VAX-11/780. Color graphics will be heavily used to summarize the data and the results of the analyses. The report describes the data that will be taken during the project and the statistical methods that will be used to analyze the data. 7 figures, 2 tables
Replicate This! Creating Individual-Level Data from Summary Statistics Using R

Science.gov (United States)

Morse, Brendan J.

2013-01-01

Incorporating realistic data and research examples into quantitative (e.g., statistics and research methods) courses has been widely recommended for enhancing student engagement and comprehension. One way to achieve these ends is to use a data generator to emulate the data in published research articles. "MorseGen" is a free data generator that…
Statistical Multipath Model Based on Experimental GNSS Data in Static Urban Canyon Environment

Directory of Open Access Journals (Sweden)

Yuze Wang

2018-04-01

Full Text Available A deep understanding of multipath characteristics is essential to design signal simulators and receivers in global navigation satellite system applications. As a new constellation is deployed and more applications occur in the urban environment, the statistical multipath models of navigation signal need further study. In this paper, we present statistical distribution models of multipath time delay, multipath power attenuation, and multipath fading frequency based on the experimental data in the urban canyon environment. The raw data of multipath characteristics are obtained by processing real navigation signal to study the statistical distribution. By fitting the statistical data, it shows that the probability distribution of time delay follows a gamma distribution which is related to the waiting time of Poisson distributed events. The fading frequency follows an exponential distribution, and the mean of multipath power attenuation decreases linearly with an increasing time delay. In addition, the detailed statistical characteristics for different elevations and orbits satellites is studied, and the parameters of each distribution are quite different. The research results give useful guidance for navigation simulator and receiver designers.
Securing cooperation from persons supplying statistical data.

Science.gov (United States)

AUBENQUE, M J; BLAIKLEY, R M; HARRIS, F F; LAL, R B; NEURDENBURG, M G; DE SHELLY HERNANDEZ, R

1954-01-01

Securing the co-operation of persons supplying information required for medical statistics is essentially a problem in human relations, and an understanding of the motivations, attitudes, and behaviour of the respondents is necessary.Before any new statistical survey is undertaken, it is suggested by Aubenque and Harris that a preliminary review be made so that the maximum use is made of existing information. Care should also be taken not to burden respondents with an overloaded questionnaire. Aubenque and Harris recommend simplified reporting. Complete population coverage is not necessary.Neurdenburg suggests that the co-operation and support of such organizations as medical associations and social security boards are important and that propaganda should be directed specifically to the groups whose co-operation is sought. Informal personal contacts are valuable and desirable, according to Blaikley, but may have adverse effects if the right kind of approach is not made.Financial payments as an incentive in securing co-operation are opposed by Neurdenburg, who proposes that only postage-free envelopes or similar small favours be granted. Blaikley and Harris, on the other hand, express the view that financial incentives may do much to gain the support of those required to furnish data; there are, however, other incentives, and full use should be made of the natural inclinations of respondents. Compulsion may be necessary in certain instances, but administrative rather than statutory measures should be adopted. Penalties, according to Aubenque, should be inflicted only when justified by imperative health requirements.The results of surveys should be made available as soon as possible to those who co-operated, and Aubenque and Harris point out that they should also be of practical value to the suppliers of the information.Greater co-operation can be secured from medical persons who have an understanding of the statistical principles involved; Aubenque and Neurdenburg
Patterns of ureteral motion: Data compression and statistics

International Nuclear Information System (INIS)

Mueller-Schauenburg, W.

1981-01-01

Images of ureteral peristaltics (ureteral kinetography) have been recorded at Tuebingen University Hospital since 1978. These images give a synoptical picture of ureteral motion in highly compressed form. Possibilities of data compression are discussed on the basis of functional path-time images, the ROI series, the in the path-time matrix, and the background subtraction. Particular attention is paid to problems of urethral activity statistics. (WU) [de

Right-sizing statistical models for longitudinal data.

Science.gov (United States)

Wood, Phillip K; Steinley, Douglas; Jackson, Kristina M

2015-12-01

Arguments are proposed that researchers using longitudinal data should consider more and less complex statistical model alternatives to their initially chosen techniques in an effort to "right-size" the model to the data at hand. Such model comparisons may alert researchers who use poorly fitting, overly parsimonious models to more complex, better-fitting alternatives and, alternatively, may identify more parsimonious alternatives to overly complex (and perhaps empirically underidentified and/or less powerful) statistical models. A general framework is proposed for considering (often nested) relationships between a variety of psychometric and growth curve models. A 3-step approach is proposed in which models are evaluated based on the number and patterning of variance components prior to selection of better-fitting growth models that explain both mean and variation-covariation patterns. The orthogonal free curve slope intercept (FCSI) growth model is considered a general model that includes, as special cases, many models, including the factor mean (FM) model (McArdle & Epstein, 1987), McDonald's (1967) linearly constrained factor model, hierarchical linear models (HLMs), repeated-measures multivariate analysis of variance (MANOVA), and the linear slope intercept (linearSI) growth model. The FCSI model, in turn, is nested within the Tuckerized factor model. The approach is illustrated by comparing alternative models in a longitudinal study of children's vocabulary and by comparing several candidate parametric growth and chronometric models in a Monte Carlo study. (c) 2015 APA, all rights reserved).
The art of data analysis how to answer almost any question using basic statistics

CERN Document Server

Jarman, Kristin H

2013-01-01

A friendly and accessible approach to applying statistics in the real worldWith an emphasis on critical thinking, The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics presents fun and unique examples, guides readers through the entire data collection and analysis process, and introduces basic statistical concepts along the way.Leaving proofs and complicated mathematics behind, the author portrays the more engaging side of statistics and emphasizes its role as a problem-solving tool. In addition, light-hearted case studies
Conjunction analysis and propositional logic in fMRI data analysis using Bayesian statistics.

Science.gov (United States)

Rudert, Thomas; Lohmann, Gabriele

2008-12-01

To evaluate logical expressions over different effects in data analyses using the general linear model (GLM) and to evaluate logical expressions over different posterior probability maps (PPMs). In functional magnetic resonance imaging (fMRI) data analysis, the GLM was applied to estimate unknown regression parameters. Based on the GLM, Bayesian statistics can be used to determine the probability of conjunction, disjunction, implication, or any other arbitrary logical expression over different effects or contrast. For second-level inferences, PPMs from individual sessions or subjects are utilized. These PPMs can be combined to a logical expression and its probability can be computed. The methods proposed in this article are applied to data from a STROOP experiment and the methods are compared to conjunction analysis approaches for test-statistics. The combination of Bayesian statistics with propositional logic provides a new approach for data analyses in fMRI. Two different methods are introduced for propositional logic: the first for analyses using the GLM and the second for common inferences about different probability maps. The methods introduced extend the idea of conjunction analysis to a full propositional logic and adapt it from test-statistics to Bayesian statistics. The new approaches allow inferences that are not possible with known standard methods in fMRI. (c) 2008 Wiley-Liss, Inc.
A Climate Statistics Tool and Data Repository

Science.gov (United States)

Wang, J.; Kotamarthi, V. R.; Kuiper, J. A.; Orr, A.

2017-12-01

Researchers at Argonne National Laboratory and collaborating organizations have generated regional scale, dynamically downscaled climate model output using Weather Research and Forecasting (WRF) version 3.3.1 at a 12km horizontal spatial resolution over much of North America. The WRF model is driven by boundary conditions obtained from three independent global scale climate models and two different future greenhouse gas emission scenarios, named representative concentration pathways (RCPs). The repository of results has a temporal resolution of three hours for all the simulations, includes more than 50 variables, is stored in Network Common Data Form (NetCDF) files, and the data volume is nearly 600Tb. A condensed 800Gb set of NetCDF files were made for selected variables most useful for climate-related planning, including daily precipitation, relative humidity, solar radiation, maximum temperature, minimum temperature, and wind. The WRF model simulations are conducted for three 10-year time periods (1995-2004, 2045-2054, and 2085-2094), and two future scenarios RCP4.5 and RCP8.5). An open-source tool was coded using Python 2.7.8 and ESRI ArcGIS 10.3.1 programming libraries to parse the NetCDF files, compute summary statistics, and output results as GIS layers. Eight sets of summary statistics were generated as examples for the contiguous U.S. states and much of Alaska, including number of days over 90°F, number of days with a heat index over 90°F, heat waves, monthly and annual precipitation, drought, extreme precipitation, multi-model averages, and model bias. This paper will provide an overview of the project to generate the main and condensed data repositories, describe the Python tool and how to use it, present the GIS results of the computed examples, and discuss some of the ways they can be used for planning. The condensed climate data, Python tool, computed GIS results, and documentation of the work are shared on the Internet.
Application of statistical dynamical turbulence closures to data assimilation

International Nuclear Information System (INIS)

O'Kane, Terence J; Frederiksen, Jorgen S

2010-01-01

We describe the development of an accurate yet computationally tractable statistical dynamical closure theory for general inhomogeneous turbulent flows, coined the quasi-diagonal direct interaction approximation closure (QDIA), and its application to problems in data assimilation. The QDIA provides prognostic equations for evolving mean fields, covariances and higher-order non-Gaussian terms, all of which are also required in the formulation of data assimilation schemes for nonlinear geophysical flows. The QDIA is a generalization of the class of direct interaction approximation theories, initially developed by Kraichnan (1959 J. Fluid Mech. 5 497) for isotropic turbulence, to fully inhomogeneous flows and has been further generalized to allow for both inhomogeneous and non-Gaussian initial conditions and long integrations. A regularization procedure or empirical vertex renormalization that ensures correct inertial range spectra is also described. The aim of this paper is to provide a coherent mathematical description of the QDIA turbulence closure and closure-based data assimilation scheme we have labeled the statistical dynamical Kalman filter. The mathematical formalism presented has been synthesized from recent works of the authors with some additional material and is presented in sufficient detail that the paper is of a pedagogical nature.
Inferential Statistics from Black Hispanic Breast Cancer Survival Data

Directory of Open Access Journals (Sweden)

Hafiz M. R. Khan

2014-01-01

Full Text Available In this paper we test the statistical probability models for breast cancer survival data for race and ethnicity. Data was collected from breast cancer patients diagnosed in United States during the years 1973–2009. We selected a stratified random sample of Black Hispanic female patients from the Surveillance Epidemiology and End Results (SEER database to derive the statistical probability models. We used three common model building criteria which include Akaike Information Criteria (AIC, Bayesian Information Criteria (BIC, and Deviance Information Criteria (DIC to measure the goodness of fit tests and it was found that Black Hispanic female patients survival data better fit the exponentiated exponential probability model. A novel Bayesian method was used to derive the posterior density function for the model parameters as well as to derive the predictive inference for future response. We specifically focused on Black Hispanic race. Markov Chain Monte Carlo (MCMC method was used for obtaining the summary results of posterior parameters. Additionally, we reported predictive intervals for future survival times. These findings would be of great significance in treatment planning and healthcare resource allocation.
Critical Views of 8th Grade Students toward Statistical Data in Newspaper Articles: Analysis in Light of Statistical Literacy

Science.gov (United States)

Guler, Mustafa; Gursoy, Kadir; Guven, Bulent

2016-01-01

Understanding and interpreting biased data, decision-making in accordance with the data, and critically evaluating situations involving data are among the fundamental skills necessary in the modern world. To develop these required skills, emphasis on statistical literacy in school mathematics has been gradually increased in recent years. The…
Statistical transformation and the interpretation of inpatient glucose control data.

Science.gov (United States)

Saulnier, George E; Castro, Janna C; Cook, Curtiss B

2014-03-01

To introduce a statistical method of assessing hospital-based non-intensive care unit (non-ICU) inpatient glucose control. Point-of-care blood glucose (POC-BG) data from hospital non-ICUs were extracted for January 1 through December 31, 2011. Glucose data distribution was examined before and after Box-Cox transformations and compared to normality. Different subsets of data were used to establish upper and lower control limits, and exponentially weighted moving average (EWMA) control charts were constructed from June, July, and October data as examples to determine if out-of-control events were identified differently in nontransformed versus transformed data. A total of 36,381 POC-BG values were analyzed. In all 3 monthly test samples, glucose distributions in nontransformed data were skewed but approached a normal distribution once transformed. Interpretation of out-of-control events from EWMA control chart analyses also revealed differences. In the June test data, an out-of-control process was identified at sample 53 with nontransformed data, whereas the transformed data remained in control for the duration of the observed period. Analysis of July data demonstrated an out-of-control process sooner in the transformed (sample 55) than nontransformed (sample 111) data, whereas for October, transformed data remained in control longer than nontransformed data. Statistical transformations increase the normal behavior of inpatient non-ICU glycemic data sets. The decision to transform glucose data could influence the interpretation and conclusions about the status of inpatient glycemic control. Further study is required to determine whether transformed versus nontransformed data influence clinical decisions or evaluation of interventions.
ROOT — A C++ framework for petabyte data storage, statistical analysis and visualization

CERN Document Server

Antcheva, I; Bellenot, B; Biskup,1, M; Brun, R; Buncic, N; Canal, Ph; Casadei, D; Couet, O; Fine, V; Franco,1, L; Ganis, G; Gheata, A; Gonzalez Maline, D; Goto, M; Iwaszkiewicz, J; Kreshuk, A; Marcos Segura, D; Maunder, R; Moneta, L; Naumann, A; Offermann, E; Onuchin, V; Panacek, S; Rademakers, F; Russo, P; Tadel, M

2009-01-01

ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web, or a number of different shared file systems. In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, the RooFit package allows the user to perform complex data modeling and fitting while the RooStats library provides abstractions and implementations for advanced statistical tools. Multivariat...
PROSA: A computer program for statistical analysis of near-real-time-accountancy (NRTA) data

International Nuclear Information System (INIS)

Beedgen, R.; Bicking, U.

1987-04-01

The computer program PROSA (Program for Statistical Analysis of NRTA Data) is a tool to decide on the basis of statistical considerations if, in a given sequence of materials balance periods, a loss of material might have occurred or not. The evaluation of the material balance data is based on statistical test procedures. In PROSA three truncated sequential tests are applied to a sequence of material balances. The manual describes the statistical background of PROSA and how to use the computer program on an IBM-PC with DOS 3.1. (orig.) [de
Common misconceptions about data analysis and statistics1

Science.gov (United States)

Motulsky, Harvey J

2015-01-01

Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason may be that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: (1) P-Hacking. This is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want. (2) Overemphasis on P values rather than on the actual size of the observed effect. (3) Overuse of statistical hypothesis testing, and being seduced by the word “significant”. (4) Overreliance on standard errors, which are often misunderstood. PMID:25692012
Data-Mining Opportunities for Small and Medium Enterprises with Official Statistics in the UK

Directory of Open Access Journals (Sweden)

Coleman Shirley Y.

2016-12-01

Full Text Available There is a growing interest in data amongst small and medium enterprises (SMEs. This article looks at ways in which SMEs can combine their internal company data with open data, such as official statistics, and thereby enhance their business opportunities. Case studies are given as illustrations of the statistical and data-mining methods involved in such integrated data analytics. The article considers the barriers that prevent more SMEs from benefitting in this field and appraises some of the initiatives that are aimed at helping to overcome them. The discussion emphasizes the importance of bringing people together from the business, IT, and statistical worlds and suggests ways for statisticians to make a greater impact.
Applications of spatial statistical network models to stream data

Science.gov (United States)

Daniel J. Isaak; Erin E. Peterson; Jay M. Ver Hoef; Seth J. Wenger; Jeffrey A. Falke; Christian E. Torgersen; Colin Sowder; E. Ashley Steel; Marie-Josee Fortin; Chris E. Jordan; Aaron S. Ruesch; Nicholas Som; Pascal. Monestiez

2014-01-01

Streams and rivers host a significant portion of Earth's biodiversity and provide important ecosystem services for human populations. Accurate information regarding the status and trends of stream resources is vital for their effective conservation and management. Most statistical techniques applied to data measured on stream networks were developed for...
Data analysis for radiological characterisation: Geostatistical and statistical complementarity

International Nuclear Information System (INIS)

Desnoyers, Yvon; Dubot, Didier

2012-01-01

Radiological characterisation may cover a large range of evaluation objectives during a decommissioning and dismantling (D and D) project: removal of doubt, delineation of contaminated materials, monitoring of the decontamination work and final survey. At each stage, collecting relevant data to be able to draw the conclusions needed is quite a big challenge. In particular two radiological characterisation stages require an advanced sampling process and data analysis, namely the initial categorization and optimisation of the materials to be removed and the final survey to demonstrate compliance with clearance levels. On the one hand the latter is widely used and well developed in national guides and norms, using random sampling designs and statistical data analysis. On the other hand a more complex evaluation methodology has to be implemented for the initial radiological characterisation, both for sampling design and for data analysis. The geostatistical framework is an efficient way to satisfy the radiological characterisation requirements providing a sound decision-making approach for the decommissioning and dismantling of nuclear premises. The relevance of the geostatistical methodology relies on the presence of a spatial continuity for radiological contamination. Thus geo-statistics provides reliable methods for activity estimation, uncertainty quantification and risk analysis, leading to a sound classification of radiological waste (surfaces and volumes). This way, the radiological characterization of contaminated premises can be divided into three steps. First, the most exhaustive facility analysis provides historical and qualitative information. Then, a systematic (exhaustive or not) surface survey of the contamination is implemented on a regular grid. Finally, in order to assess activity levels and contamination depths, destructive samples are collected at several locations within the premises (based on the surface survey results) and analysed. Combined with
Identification of reliable gridded reference data for statistical downscaling methods in Alberta

Science.gov (United States)

Eum, H. I.; Gupta, A.

2017-12-01

Climate models provide essential information to assess impacts of climate change at regional and global scales. However, statistical downscaling methods have been applied to prepare climate model data for various applications such as hydrologic and ecologic modelling at a watershed scale. As the reliability and (spatial and temporal) resolution of statistically downscaled climate data mainly depend on a reference data, identifying the most reliable reference data is crucial for statistical downscaling. A growing number of gridded climate products are available for key climate variables which are main input data to regional modelling systems. However, inconsistencies in these climate products, for example, different combinations of climate variables, varying data domains and data lengths and data accuracy varying with physiographic characteristics of the landscape, have caused significant challenges in selecting the most suitable reference climate data for various environmental studies and modelling. Employing various observation-based daily gridded climate products available in public domain, i.e. thin plate spline regression products (ANUSPLIN and TPS), inverse distance method (Alberta Townships), and numerical climate model (North American Regional Reanalysis) and an optimum interpolation technique (Canadian Precipitation Analysis), this study evaluates the accuracy of the climate products at each grid point by comparing with the Adjusted and Homogenized Canadian Climate Data (AHCCD) observations for precipitation, minimum and maximum temperature over the province of Alberta. Based on the performance of climate products at AHCCD stations, we ranked the reliability of these publically available climate products corresponding to the elevations of stations discretized into several classes. According to the rank of climate products for each elevation class, we identified the most reliable climate products based on the elevation of target points. A web-based system
Statistical yearbook 2002-2004. Data available as of February 2005. 49 ed

International Nuclear Information System (INIS)

2005-09-01

This is the forty-ninth issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat. The data included generally cover the years between 1993 and 2003 and are, for the most part, those statistics which were available to the Statistics Division as of February 2005. The 81 tables of the Yearbook are based on data compiled by the Statistics Division from over 35 international and national sources. These sources include the United Nations Statistics Division in the fields of national accounts, industry, energy, transport and international trade, the United Nations Statistics Division and Population Division in the field of demographic statistics, and over 20 offices of the United Nations system and international organizations in other specialized fields. The Yearbook is organized in four parts. The first part, World and Region Summary, presents key world and regional aggregates and totals. In the other three parts, the subject matter is generally presented by countries or areas, with world and regional aggregates shown in some cases only. Parts two, three and four cover, respectively, population and social topics, national economic activity, and international economic relations. Each chapter ends with brief technical notes on statistical sources and methods for the tables it includes. References to sources and related methodological publications are provided at the end of the Yearbook in the section 'Statistical sources and references'. Annex I provides complete information on country and area nomenclature, and regional and other groupings used in the Yearbook. Annex II lists conversion coefficients and factors used in various tables. A list of tables added to or omitted from the last issue of the Yearbook is given in annex III. Symbols and conventions used in the Yearbook are shown in the section 'Explanatory notes, preceding the Introduction
A Statistical Toolbox For Mining And Modeling Spatial Data

Directory of Open Access Journals (Sweden)

D’Aubigny Gérard

2016-12-01

Full Text Available Most data mining projects in spatial economics start with an evaluation of a set of attribute variables on a sample of spatial entities, looking for the existence and strength of spatial autocorrelation, based on the Moran’s and the Geary’s coefficients, the adequacy of which is rarely challenged, despite the fact that when reporting on their properties, many users seem likely to make mistakes and to foster confusion. My paper begins by a critical appraisal of the classical definition and rational of these indices. I argue that while intuitively founded, they are plagued by an inconsistency in their conception. Then, I propose a principled small change leading to corrected spatial autocorrelation coefficients, which strongly simplifies their relationship, and opens the way to an augmented toolbox of statistical methods of dimension reduction and data visualization, also useful for modeling purposes. A second section presents a formal framework, adapted from recent work in statistical learning, which gives theoretical support to our definition of corrected spatial autocorrelation coefficients. More specifically, the multivariate data mining methods presented here, are easily implementable on the existing (free software, yield methods useful to exploit the proposed corrections in spatial data analysis practice, and, from a mathematical point of view, whose asymptotic behavior, already studied in a series of papers by Belkin & Niyogi, suggests that they own qualities of robustness and a limited sensitivity to the Modifiable Areal Unit Problem (MAUP, valuable in exploratory spatial data analysis.
Explanation of the methods employed in the statistical evaluation of SALE program data

International Nuclear Information System (INIS)

Bracey, J.T.; Soriano, M.

1981-01-01

The analysis of Safeguards Analytical Laboratory Evaluation (SALE) bimonthly data is described. Statistical procedures are discussed in Section A, followed by the descriptions of tabular and graphic values in Section B. Calculation formulae for the various statistics in the reports are presented in Section C. SALE data reported to New Brunswick Laboratory (NBL) are entered into a computerized system through routine data processing procedures. Bimonthly and annual reports are generated from this data system. In the bimonthly data analysis, data from the six most recent reporting periods of each laboratory-material-analytical method combination are utilized. Analysis results in the bimonthly reports are only presented for those participants who have reported data at least once during the last 12-month period. Reported values are transformed to relative percent difference values calculated by [(reported value - reference value)/reference value] x 100. Analysis of data is performed on these transformed values. Accordingly, the results given in the bimonthly report are (relative) percent differences (% DIFF). Suspect, large variations are verified with individual participants to eliminate errors in the transcription process. Statistical extreme values are not excluded from bimonthly analysis; all data are used
Data Acquisition and Preprocessing in Studies on Humans: What Is Not Taught in Statistics Classes?

Science.gov (United States)

Zhu, Yeyi; Hernandez, Ladia M; Mueller, Peter; Dong, Yongquan; Forman, Michele R

2013-01-01

The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.
Rule-based statistical data mining agents for an e-commerce application

Science.gov (United States)

Qin, Yi; Zhang, Yan-Qing; King, K. N.; Sunderraman, Rajshekhar

2003-03-01

Intelligent data mining techniques have useful e-Business applications. Because an e-Commerce application is related to multiple domains such as statistical analysis, market competition, price comparison, profit improvement and personal preferences, this paper presents a hybrid knowledge-based e-Commerce system fusing intelligent techniques, statistical data mining, and personal information to enhance QoS (Quality of Service) of e-Commerce. A Web-based e-Commerce application software system, eDVD Web Shopping Center, is successfully implemented uisng Java servlets and an Oracle81 database server. Simulation results have shown that the hybrid intelligent e-Commerce system is able to make smart decisions for different customers.

Powerful Inference with the D-Statistic on Low-Coverage Whole-Genome Data.

Science.gov (United States)

Soraggi, Samuele; Wiuf, Carsten; Albrechtsen, Anders

2018-02-02

The detection of ancient gene flow between human populations is an important issue in population genetics. A common tool for detecting ancient admixture events is the D-statistic. The D-statistic is based on the hypothesis of a genetic relationship that involves four populations, whose correctness is assessed by evaluating specific coincidences of alleles between the groups. When working with high-throughput sequencing data, calling genotypes accurately is not always possible; therefore, the D-statistic currently samples a single base from the reads of one individual per population. This implies ignoring much of the information in the data, an issue especially striking in the case of ancient genomes. We provide a significant improvement to overcome the problems of the D-statistic by considering all reads from multiple individuals in each population. We also apply type-specific error correction to combat the problems of sequencing errors, and show a way to correct for introgression from an external population that is not part of the supposed genetic relationship, and how this leads to an estimate of the admixture rate. We prove that the D-statistic is approximated by a standard normal distribution. Furthermore, we show that our method outperforms the traditional D-statistic in detecting admixtures. The power gain is most pronounced for low and medium sequencing depth (1-10×), and performances are as good as with perfectly called genotypes at a sequencing depth of 2×. We show the reliability of error correction in scenarios with simulated errors and ancient data, and correct for introgression in known scenarios to estimate the admixture rates. Copyright © 2018 Soraggi et al.
Statistical Analysis for High-Dimensional Data : The Abel Symposium 2014

CERN Document Server

Bühlmann, Peter; Glad, Ingrid; Langaas, Mette; Richardson, Sylvia; Vannucci, Marina

2016-01-01

This book features research contributions from The Abel Symposium on Statistical Analysis for High Dimensional Data, held in Nyvågar, Lofoten, Norway, in May 2014. The focus of the symposium was on statistical and machine learning methodologies specifically developed for inference in “big data” situations, with particular reference to genomic applications. The contributors, who are among the most prominent researchers on the theory of statistics for high dimensional inference, present new theories and methods, as well as challenging applications and computational solutions. Specific themes include, among others, variable selection and screening, penalised regression, sparsity, thresholding, low dimensional structures, computational challenges, non-convex situations, learning graphical models, sparse covariance and precision matrices, semi- and non-parametric formulations, multiple testing, classification, factor models, clustering, and preselection. Highlighting cutting-edge research and casting light on...
A Scan Statistic for Continuous Data Based on the Normal Probability Model

OpenAIRE

Konty, Kevin; Kulldorff, Martin; Huang, Lan

2009-01-01

Abstract Temporal, spatial and space-time scan statistics are commonly used to detect and evaluate the statistical significance of temporal and/or geographical disease clusters, without any prior assumptions on the location, time period or size of those clusters. Scan statistics are mostly used for count data, such as disease incidence or mortality. Sometimes there is an interest in looking for clusters with respect to a continuous variable, such as lead levels in children or low birth weight...
mapDIA: Preprocessing and statistical analysis of quantitative proteomics data from data independent acquisition mass spectrometry.

Science.gov (United States)

Teo, Guoshou; Kim, Sinae; Tsou, Chih-Chiang; Collins, Ben; Gingras, Anne-Claude; Nesvizhskii, Alexey I; Choi, Hyungwon

2015-11-03

Data independent acquisition (DIA) mass spectrometry is an emerging technique that offers more complete detection and quantification of peptides and proteins across multiple samples. DIA allows fragment-level quantification, which can be considered as repeated measurements of the abundance of the corresponding peptides and proteins in the downstream statistical analysis. However, few statistical approaches are available for aggregating these complex fragment-level data into peptide- or protein-level statistical summaries. In this work, we describe a software package, mapDIA, for statistical analysis of differential protein expression using DIA fragment-level intensities. The workflow consists of three major steps: intensity normalization, peptide/fragment selection, and statistical analysis. First, mapDIA offers normalization of fragment-level intensities by total intensity sums as well as a novel alternative normalization by local intensity sums in retention time space. Second, mapDIA removes outlier observations and selects peptides/fragments that preserve the major quantitative patterns across all samples for each protein. Last, using the selected fragments and peptides, mapDIA performs model-based statistical significance analysis of protein-level differential expression between specified groups of samples. Using a comprehensive set of simulation datasets, we show that mapDIA detects differentially expressed proteins with accurate control of the false discovery rates. We also describe the analysis procedure in detail using two recently published DIA datasets generated for 14-3-3β dynamic interaction network and prostate cancer glycoproteome. The software was written in C++ language and the source code is available for free through SourceForge website http://sourceforge.net/projects/mapdia/.This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015 Elsevier B.V. All rights reserved.
Statistical Approaches Accomodating Uncertainty in Modern Genomic Data

DEFF Research Database (Denmark)

Skotte, Line

the contributed method applicable to case-control studies as well as mapping of quantitative traits. The contributed method provides a needed association test for quantitative traits in the presence of uncertain genotypes and it further allows correction for population structure in association tests for disease...... the potential of the technological advances. The first of the four papers included in this thesis describes a new method for association mapping that accommodates uncertain genotypes from low-coverage re-sequencing data. The method allows uncertain genotypes using a score statistic based on the joint likelihood...... of the observed phenotypes and the observed sequencing data. This joint likelihood accounts for the genotype uncertainties via the posterior probabilities of each genotype given the observed sequencing data and the phenotype distributions are modelled using a generalised linear model framework which makes...
QB2OLAP : enabling OLAP on statistical linked open data

OpenAIRE

Varga, Jovan; Etcheverry, Lorena; Vaisman, Alejandro; Romero Moral, Óscar; Bach Pedersen, Torben; Thomsen, Christian

2016-01-01

Publication and sharing of multidimensional (MD) data on the Semantic Web (SW) opens new opportunities for the use of On-Line Analytical Processing (OLAP). The RDF Data Cube (QB) vocabulary, the current standard for statistical data publishing, however, lacks key MD concepts such as dimension hierarchies and aggregate functions. QB4OLAP was proposed to remedy this. However, QB4OLAP requires extensive manual annotation and users must still write queries in SPARQL, the standard query language f...
Calculation of Tajima's D and other neutrality test statistics from low depth next-generation sequencing data

DEFF Research Database (Denmark)

Korneliussen, Thorfinn Sand; Moltke, Ida; Albrechtsen, Anders

2013-01-01

A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima's D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. Howeve......, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions....
Statistical analysis of solid waste composition data: Arithmetic mean, standard deviation and correlation coefficients.

Science.gov (United States)

Edjabou, Maklawe Essonanawe; Martín-Fernández, Josep Antoni; Scheutz, Charlotte; Astrup, Thomas Fruergaard

2017-11-01

Data for fractional solid waste composition provide relative magnitudes of individual waste fractions, the percentages of which always sum to 100, thereby connecting them intrinsically. Due to this sum constraint, waste composition data represent closed data, and their interpretation and analysis require statistical methods, other than classical statistics that are suitable only for non-constrained data such as absolute values. However, the closed characteristics of waste composition data are often ignored when analysed. The results of this study showed, for example, that unavoidable animal-derived food waste amounted to 2.21±3.12% with a confidence interval of (-4.03; 8.45), which highlights the problem of the biased negative proportions. A Pearson's correlation test, applied to waste fraction generation (kg mass), indicated a positive correlation between avoidable vegetable food waste and plastic packaging. However, correlation tests applied to waste fraction compositions (percentage values) showed a negative association in this regard, thus demonstrating that statistical analyses applied to compositional waste fraction data, without addressing the closed characteristics of these data, have the potential to generate spurious or misleading results. Therefore, ¨compositional data should be transformed adequately prior to any statistical analysis, such as computing mean, standard deviation and correlation coefficients. Copyright © 2017 Elsevier Ltd. All rights reserved.
Bayesian Sensitivity Analysis of Statistical Models with Missing Data.

Science.gov (United States)

Zhu, Hongtu; Ibrahim, Joseph G; Tang, Niansheng

2014-04-01

Methods for handling missing data depend strongly on the mechanism that generated the missing values, such as missing completely at random (MCAR) or missing at random (MAR), as well as other distributional and modeling assumptions at various stages. It is well known that the resulting estimates and tests may be sensitive to these assumptions as well as to outlying observations. In this paper, we introduce various perturbations to modeling assumptions and individual observations, and then develop a formal sensitivity analysis to assess these perturbations in the Bayesian analysis of statistical models with missing data. We develop a geometric framework, called the Bayesian perturbation manifold, to characterize the intrinsic structure of these perturbations. We propose several intrinsic influence measures to perform sensitivity analysis and quantify the effect of various perturbations to statistical models. We use the proposed sensitivity analysis procedure to systematically investigate the tenability of the non-ignorable missing at random (NMAR) assumption. Simulation studies are conducted to evaluate our methods, and a dataset is analyzed to illustrate the use of our diagnostic measures.
Misuse of statistics in the interpretation of data on low-level radiation

International Nuclear Information System (INIS)

Hamilton, L.D.

1982-01-01

Four misuses of statistics in the interpretation of data of low-level radiation are reviewed: (1) post-hoc analysis and aggregation of data leading to faulty conclusions in the reanalysis of genetic effects of the atomic bomb, and premature conclusions on the Portsmouth Naval Shipyard data; (2) inappropriate adjustment for age and ignoring differences between urban and rural areas leading to potentially spurious increase in incidence of cancer at Rocky Flats; (3) hazard of summary statistics based on ill-conditioned individual rates leading to spurious association between childhood leukemia and fallout in Utah; and (4) the danger of prematurely published preliminary work with inadequate consideration of epidemiological problems - censored data - leading to inappropriate conclusions, needless alarm at the Portsmouth Naval Shipyard, and diversion of scarce research funds
Misuse of statistics in the interpretation of data on low-level radiation

Energy Technology Data Exchange (ETDEWEB)

Hamilton, L.D.

1982-01-01

Four misuses of statistics in the interpretation of data of low-level radiation are reviewed: (1) post-hoc analysis and aggregation of data leading to faulty conclusions in the reanalysis of genetic effects of the atomic bomb, and premature conclusions on the Portsmouth Naval Shipyard data; (2) inappropriate adjustment for age and ignoring differences between urban and rural areas leading to potentially spurious increase in incidence of cancer at Rocky Flats; (3) hazard of summary statistics based on ill-conditioned individual rates leading to spurious association between childhood leukemia and fallout in Utah; and (4) the danger of prematurely published preliminary work with inadequate consideration of epidemiological problems - censored data - leading to inappropriate conclusions, needless alarm at the Portsmouth Naval Shipyard, and diversion of scarce research funds.
A framework for the economic analysis of data collection methods for vital statistics.

Science.gov (United States)

Jimenez-Soto, Eliana; Hodge, Andrew; Nguyen, Kim-Huong; Dettrick, Zoe; Lopez, Alan D

2014-01-01

Over recent years there has been a strong movement towards the improvement of vital statistics and other types of health data that inform evidence-based policies. Collecting such data is not cost free. To date there is no systematic framework to guide investment decisions on methods of data collection for vital statistics or health information in general. We developed a framework to systematically assess the comparative costs and outcomes/benefits of the various data methods for collecting vital statistics. The proposed framework is four-pronged and utilises two major economic approaches to systematically assess the available data collection methods: cost-effectiveness analysis and efficiency analysis. We built a stylised example of a hypothetical low-income country to perform a simulation exercise in order to illustrate an application of the framework. Using simulated data, the results from the stylised example show that the rankings of the data collection methods are not affected by the use of either cost-effectiveness or efficiency analysis. However, the rankings are affected by how quantities are measured. There have been several calls for global improvements in collecting useable data, including vital statistics, from health information systems to inform public health policies. Ours is the first study that proposes a systematic framework to assist countries undertake an economic evaluation of DCMs. Despite numerous challenges, we demonstrate that a systematic assessment of outputs and costs of DCMs is not only necessary, but also feasible. The proposed framework is general enough to be easily extended to other areas of health information.
The Use of Advanced Transportation Monitoring Data for Official Statistics

NARCIS (Netherlands)

Y. Ma (Yinyi)

2016-01-01

markdownabstractTraﬃc and transportation statistics are mainly published as aggregated information, and are traditionally based on surveys or secondary data sources, like public registers and companies’ administrations. Nowadays, advanced monitoring systems are installed in the road network, oﬀering
Maximum entropy prior uncertainty and correlation of statistical economic data

NARCIS (Netherlands)

Dias, Rodriques J.F.

2016-01-01

Empirical estimates of source statistical economic data such as trade flows, greenhouse gas emissions or employment figures are always subject to uncertainty (stemming from measurement errors or confidentiality) but information concerning that uncertainty is often missing. This paper uses concepts
Statistical yearbook 2001. Data available as of 15 December 2003. 48 ed

International Nuclear Information System (INIS)

2004-01-01

This is the forty-eight issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat. It contains series covering, in general, 1990-1999 or 1991-2000, based on statistics available to the Statistics Division up to 15 December 2003. The major purpose of the Statistical Yearbook is to provide in a single volume a comprehensive compilation of internationally available statistics on social and economic conditions and activities, at world, regional and national levels, covering roughly a ten-year period. Most of the statistics presented in the Yearbook are extracted from more detailed, specialized publications prepared by the Statistics Division and by many other international statistical services. Thus, while the specialized publications concentrate on monitoring topics and trends in particular social and economic fields, the Statistical Yearbook tables provide data for a more comprehensive, overall description of social and economic structures, conditions, changes and activities. The objective has been to collect, systematize and coordinate the most essential components of comparable statistical information which can give a broad and, to the extent feasible, a consistent picture of social and economic processes at world, regional and national levels. More specifically, the Statistical Yearbook provides systematic information on a wide range of social and economic issues which are of concern in the United Nations system and among the governments and peoples of the world. A particular value of the Yearbook, but also its greatest challenge, is that these issues are extensively interrelated. Meaningful analysis of these issues requires systematization and coordination of the data across many fields. These issues include: General economic growth and related economic conditions; economic situation in developing countries and progress towards the objectives adopted for the
Statistical issues in the parton distribution analysis of the Tevatron jet data

International Nuclear Information System (INIS)

Alekhin, S.; Bluemlein, J.; Moch, S.O.; Hamburg Univ.

2012-11-01

We analyse a tension between the D0 and CDF inclusive jet data and the perturbative QCD calculations, which are based on the ABKM09 and ABM11 parton distribution functions (PDFs) within the nuisance parameter framework. Particular attention is paid on the uncertainties in the nuisance parameters due to the data fluctuations and the PDF errors. We show that with account of these uncertainties the nuisance parameters do not demonstrate a statistically significant excess. A statistical bias of the estimator based on the nuisance parameters is also discussed.
Statistical methods for data analysis in particle physics

CERN Document Server

Lista, Luca

2017-01-01

This concise set of course-based notes provides the reader with the main concepts and tools needed to perform statistical analyses of experimental data, in particular in the field of high-energy physics (HEP). First, the book provides an introduction to probability theory and basic statistics, mainly intended as a refresher from readers’ advanced undergraduate studies, but also to help them clearly distinguish between the Frequentist and Bayesian approaches and interpretations in subsequent applications. More advanced concepts and applications are gradually introduced, culminating in the chapter on both discoveries and upper limits, as many applications in HEP concern hypothesis testing, where the main goal is often to provide better and better limits so as to eventually be able to distinguish between competing hypotheses, or to rule out some of them altogether. Many worked-out examples will help newcomers to the field and graduate students alike understand the pitfalls involved in applying theoretical co...
A statistical method for evaluation of the experimental phase equilibrium data of simple clathrate hydrates

DEFF Research Database (Denmark)

Eslamimanesh, Ali; Gharagheizi, Farhad; Mohammadi, Amir H.

2012-01-01

We, herein, present a statistical method for diagnostics of the outliers in phase equilibrium data (dissociation data) of simple clathrate hydrates. The applied algorithm is performed on the basis of the Leverage mathematical approach, in which the statistical Hat matrix, Williams Plot, and the r......We, herein, present a statistical method for diagnostics of the outliers in phase equilibrium data (dissociation data) of simple clathrate hydrates. The applied algorithm is performed on the basis of the Leverage mathematical approach, in which the statistical Hat matrix, Williams Plot...... in exponential form is used to represent/predict the hydrate dissociation pressures for three-phase equilibrium conditions (liquid water/ice–vapor-hydrate). The investigated hydrate formers are methane, ethane, propane, carbon dioxide, nitrogen, and hydrogen sulfide. It is interpreted from the obtained results...
Cosmology constraints from shear peak statistics in Dark Energy Survey Science Verification data

International Nuclear Information System (INIS)

Kacprzak, T.; Kirk, D.; Friedrich, O.; Amara, A.; Refregier, A.

2016-01-01

Shear peak statistics has gained a lot of attention recently as a practical alternative to the two-point statistics for constraining cosmological parameters. We perform a shear peak statistics analysis of the Dark Energy Survey (DES) Science Verification (SV) data, using weak gravitational lensing measurements from a 139 deg"2 field. We measure the abundance of peaks identified in aperture mass maps, as a function of their signal-to-noise ratio, in the signal-to-noise range 0 4 would require significant corrections, which is why we do not include them in our analysis. We compare our results to the cosmological constraints from the two-point analysis on the SV field and find them to be in good agreement in both the central value and its uncertainty. Lastly, we discuss prospects for future peak statistics analysis with upcoming DES data.
SAS and R data management, statistical analysis, and graphics

CERN Document Server

Kleinman, Ken

2009-01-01

An All-in-One Resource for Using SAS and R to Carry out Common TasksProvides a path between languages that is easier than reading complete documentationSAS and R: Data Management, Statistical Analysis, and Graphics presents an easy way to learn how to perform an analytical task in both SAS and R, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation. The book covers many common tasks, such as data management, descriptive summaries, inferential procedures, regression analysis, and the creation of graphics, along with more complex applicat

Using R for Data Management, Statistical Analysis, and Graphics

CERN Document Server

Horton, Nicholas J

2010-01-01

This title offers quick and easy access to key element of documentation. It includes worked examples across a wide variety of applications, tasks, and graphics. "Using R for Data Management, Statistical Analysis, and Graphics" presents an easy way to learn how to perform an analytical task in R, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation and vast number of add-on packages. Organized by short, clear descriptive entries, the book covers many common tasks, such as data management, descriptive summaries, inferential proc
A flexible statistics web processing service--added value for information systems for experiment data.

Science.gov (United States)

Heimann, Dennis; Nieschulze, Jens; König-Ries, Birgitta

2010-04-20

Data management in the life sciences has evolved from simple storage of data to complex information systems providing additional functionalities like analysis and visualization capabilities, demanding the integration of statistical tools. In many cases the used statistical tools are hard-coded within the system. That leads to an expensive integration, substitution, or extension of tools because all changes have to be done in program code. Other systems are using generic solutions for tool integration but adapting them to another system is mostly rather extensive work. This paper shows a way to provide statistical functionality over a statistics web service, which can be easily integrated in any information system and set up using XML configuration files. The statistical functionality is extendable by simply adding the description of a new application to a configuration file. The service architecture as well as the data exchange process between client and service and the adding of analysis applications to the underlying service provider are described. Furthermore a practical example demonstrates the functionality of the service.
Statistical Analysis of Data with Non-Detectable Values

Energy Technology Data Exchange (ETDEWEB)

Frome, E.L.

2004-08-26

Environmental exposure measurements are, in general, positive and may be subject to left censoring, i.e. the measured value is less than a ''limit of detection''. In occupational monitoring, strategies for assessing workplace exposures typically focus on the mean exposure level or the probability that any measurement exceeds a limit. A basic problem of interest in environmental risk assessment is to determine if the mean concentration of an analyte is less than a prescribed action level. Parametric methods, used to determine acceptable levels of exposure, are often based on a two parameter lognormal distribution. The mean exposure level and/or an upper percentile (e.g. the 95th percentile) are used to characterize exposure levels, and upper confidence limits are needed to describe the uncertainty in these estimates. In certain situations it is of interest to estimate the probability of observing a future (or ''missed'') value of a lognormal variable. Statistical methods for random samples (without non-detects) from the lognormal distribution are well known for each of these situations. In this report, methods for estimating these quantities based on the maximum likelihood method for randomly left censored lognormal data are described and graphical methods are used to evaluate the lognormal assumption. If the lognormal model is in doubt and an alternative distribution for the exposure profile of a similar exposure group is not available, then nonparametric methods for left censored data are used. The mean exposure level, along with the upper confidence limit, is obtained using the product limit estimate, and the upper confidence limit on the 95th percentile (i.e. the upper tolerance limit) is obtained using a nonparametric approach. All of these methods are well known but computational complexity has limited their use in routine data analysis with left censored data. The recent development of the R environment for statistical
Bias in iterative reconstruction of low-statistics PET data: benefits of a resolution model

Energy Technology Data Exchange (ETDEWEB)

Walker, M D; Asselin, M-C; Julyan, P J; Feldmann, M; Matthews, J C [School of Cancer and Enabling Sciences, Wolfson Molecular Imaging Centre, MAHSC, University of Manchester, Manchester M20 3LJ (United Kingdom); Talbot, P S [Mental Health and Neurodegeneration Research Group, Wolfson Molecular Imaging Centre, MAHSC, University of Manchester, Manchester M20 3LJ (United Kingdom); Jones, T, E-mail: matthew.walker@manchester.ac.uk [Academic Department of Radiation Oncology, Christie Hospital, University of Manchester, Manchester M20 4BX (United Kingdom)

2011-02-21

Iterative image reconstruction methods such as ordered-subset expectation maximization (OSEM) are widely used in PET. Reconstructions via OSEM are however reported to be biased for low-count data. We investigated this and considered the impact for dynamic PET. Patient listmode data were acquired in [{sup 11}C]DASB and [{sup 15}O]H{sub 2}O scans on the HRRT brain PET scanner. These data were subsampled to create many independent, low-count replicates. The data were reconstructed and the images from low-count data were compared to the high-count originals (from the same reconstruction method). This comparison enabled low-statistics bias to be calculated for the given reconstruction, as a function of the noise-equivalent counts (NEC). Two iterative reconstruction methods were tested, one with and one without an image-based resolution model (RM). Significant bias was observed when reconstructing data of low statistical quality, for both subsampled human and simulated data. For human data, this bias was substantially reduced by including a RM. For [{sup 11}C]DASB the low-statistics bias in the caudate head at 1.7 M NEC (approx. 30 s) was -5.5% and -13% with and without RM, respectively. We predicted biases in the binding potential of -4% and -10%. For quantification of cerebral blood flow for the whole-brain grey- or white-matter, using [{sup 15}O]H{sub 2}O and the PET autoradiographic method, a low-statistics bias of <2.5% and <4% was predicted for reconstruction with and without the RM. The use of a resolution model reduces low-statistics bias and can hence be beneficial for quantitative dynamic PET.
Study on loss detection algorithms for tank monitoring data using multivariate statistical analysis

International Nuclear Information System (INIS)

Suzuki, Mitsutoshi; Burr, Tom

2009-01-01

Evaluation of solution monitoring data to support material balance evaluation was proposed about a decade ago because of concerns regarding the large throughput planned at Rokkasho Reprocessing Plant (RRP). A numerical study using the simulation code (FACSIM) was done and significant increases in the detection probabilities (DP) for certain types of losses were shown. To be accepted internationally, it is very important to verify such claims using real solution monitoring data. However, a demonstrative study with real tank data has not been carried out due to the confidentiality of the tank data. This paper describes an experimental study that has been started using actual data from the Solution Measurement and Monitoring System (SMMS) in the Tokai Reprocessing Plant (TRP) and the Savannah River Site (SRS). Multivariate statistical methods, such as a vector cumulative sum and a multi-scale statistical analysis, have been applied to the real tank data that have superimposed simulated loss. Although quantitative conclusions have not been derived for the moment due to the difficulty of baseline evaluation, the multivariate statistical methods remain promising for abrupt and some types of protracted loss detection. (author)
Analysis of spectral data with rare events statistics

International Nuclear Information System (INIS)

Ilyushchenko, V.I.; Chernov, N.I.

1990-01-01

The case is considered of analyzing experimental data, when the results of individual experimental runs cannot be summed due to large systematic errors. A statistical analysis of the hypothesis about the persistent peaks in the spectra has been performed by means of the Neyman-Pearson test. The computations demonstrate the confidence level for the hypothesis about the presence of a persistent peak in the spectrum is proportional to the square root of the number of independent experimental runs, K. 5 refs
Data on education: from population statistics to epidemiological research

DEFF Research Database (Denmark)

Pallesen, Palle Bo; Tverborgvik, Torill; Rasmussen, Hanna Barbara

2010-01-01

BACKGROUND: Level of education is in many fields of research used as an indicator of social status. METHODS: Using Statistics Denmark's register for education and employment of the population, we examined highest completed education with a birth-cohort perspective focusing on people born between...... of population trends by use of extrapolated values, solutions are less obvious in epidemiological research using individual level data....
Novel Kalman filter algorithm for statistical monitoring of extensive landscapes with synoptic sensor data

Science.gov (United States)

Raymond L. Czaplewski

2015-01-01

Wall-to-wall remotely sensed data are increasingly available to monitor landscape dynamics over large geographic areas. However, statistical monitoring programs that use post-stratification cannot fully utilize those sensor data. The Kalman filter (KF) is an alternative statistical estimator. I develop a new KF algorithm that is numerically robust with large numbers of...
The United Nations recommendations and data efforts: international migration statistics.

Science.gov (United States)

Simmons, A B

1987-01-01

This article reviews the UN's efforts to improve international migration statistics. The review addresses the challenges faced by the UN, the direction in which this effort is going, gaps in the current approach, and priorities for future action. The content of the UN recommendations has changed in the past and seems to be moving toward further changes. At each stage, the direction of change corresponds broadly to earlier shifts in the overall context of world social-economic affairs and related transformations in international travel and migration patterns. Early (1953) objectives were vaguely stated in terms of social, economic, and demographic impacts of long term settlement. 1976 recommendations continued the focus on long term resettlement and, at the same time, gave more attention to at least 1 kind of short term (work-related) movement. Most recent recommendations have given more attention to other classes of short term travellers, such as refugees and contract workers. Recommendations on the measures and data sources have changed over time, also. The 1953 recommendations were limited to flow data from international border statistics. 1976 recommendations drew attention to stock data and the use of civil registration data to supplement border crossing data. Recent UN reflections recognize that the volume of border crossings has now reached the point where many countries simply refuse to gather data on all travellers, choosing instead to make estimates. It is implied that either sample surveys at border points and/or visas and entry permits may be the best way of counting various specific kinds of migrants. Future recommendations corresponding to contemporary and emerging concerns will require that the guidelines be restructured: 1) to give more explicit attention in international migration statistics to citizenship and access to political and welfare benefits; 2) to distinguish more carefully various sub-classes of movers; 3) to expand objectives of data
ROOT - A C++ Framework for Petabyte Data Storage, Statistical Analysis and Visualization

CERN Document Server

Naumann, Axel; Ballintijn, Maarten; Bellenot, Bertrand; Biskup, Marek; Brun, Rene; Buncic, Nenad; Canal, Philippe; Casadei, Diego; Couet, Olivier; Fine, Valery; Franco, Leandro; Ganis, Gerardo; Gheata, Andrei; Gonzalez~Maline, David; Goto, Masaharu; Iwaszkiewicz, Jan; Kreshuk, Anna; Marcos Segura, Diego; Maunder, Richard; Moneta, Lorenzo; Offermann, Eddy; Onuchin, Valeriy; Panacek, Suzanne; Rademakers, Fons; Russo, Paul; Tadel, Matevz

2009-01-01

ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web, or a number of different shared file systems. In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, the RooFit package allows the user to perform complex data modeling and fitting while the RooStats library provides abstractions and implementations for advance...
Statistical analysis of proteomics, metabolomics, and lipidomics data using mass spectrometry

CERN Document Server

Mertens, Bart

2017-01-01

This book presents an overview of computational and statistical design and analysis of mass spectrometry-based proteomics, metabolomics, and lipidomics data. This contributed volume provides an introduction to the special aspects of statistical design and analysis with mass spectrometry data for the new omic sciences. The text discusses common aspects of design and analysis between and across all (or most) forms of mass spectrometry, while also providing special examples of application with the most common forms of mass spectrometry. Also covered are applications of computational mass spectrometry not only in clinical study but also in the interpretation of omics data in plant biology studies. Omics research fields are expected to revolutionize biomolecular research by the ability to simultaneously profile many compounds within either patient blood, urine, tissue, or other biological samples. Mass spectrometry is one of the key analytical techniques used in these new omic sciences. Liquid chromatography mass ...
A statistically self-consistent type Ia supernova data analysis

International Nuclear Information System (INIS)

Lago, B.L.; Calvao, M.O.; Joras, S.E.; Reis, R.R.R.; Waga, I.; Giostri, R.

2011-01-01

Full text: The type Ia supernovae are one of the main cosmological probes nowadays and are used as standardized candles in distance measurements. The standardization processes, among which SALT2 and MLCS2k2 are the most used ones, are based on empirical relations and leave room for a residual dispersion in the light curves of the supernovae. This dispersion is introduced in the chi squared used to fit the parameters of the model in the expression for the variance of the data, as an attempt to quantify our ignorance in modeling the supernovae properly. The procedure used to assign a value to this dispersion is statistically inconsistent and excludes the possibility of comparing different cosmological models. In addition, the SALT2 light curve fitter introduces parameters on the model for the variance that are also used in the model for the data. In the chi squared statistics context the minimization of such a quantity yields, in the best case scenario, a bias. An iterative method has been developed in order to perform the minimization of this chi squared but it is not well grounded, although it is used by several groups. We propose an analysis of the type Ia supernovae data that is based on the likelihood itself and makes it possible to address both inconsistencies mentioned above in a straightforward way. (author)
Encoding Dissimilarity Data for Statistical Model Building.

Science.gov (United States)

Wahba, Grace

2010-12-01

We summarize, review and comment upon three papers which discuss the use of discrete, noisy, incomplete, scattered pairwise dissimilarity data in statistical model building. Convex cone optimization codes are used to embed the objects into a Euclidean space which respects the dissimilarity information while controlling the dimension of the space. A "newbie" algorithm is provided for embedding new objects into this space. This allows the dissimilarity information to be incorporated into a Smoothing Spline ANOVA penalized likelihood model, a Support Vector Machine, or any model that will admit Reproducing Kernel Hilbert Space components, for nonparametric regression, supervised learning, or semi-supervised learning. Future work and open questions are discussed. The papers are: F. Lu, S. Keles, S. Wright and G. Wahba 2005. A framework for kernel regularization with application to protein clustering. Proceedings of the National Academy of Sciences 102, 12332-1233.G. Corrada Bravo, G. Wahba, K. Lee, B. Klein, R. Klein and S. Iyengar 2009. Examining the relative influence of familial, genetic and environmental covariate information in flexible risk models. Proceedings of the National Academy of Sciences 106, 8128-8133F. Lu, Y. Lin and G. Wahba. Robust manifold unfolding with kernel regularization. TR 1008, Department of Statistics, University of Wisconsin-Madison.
China Dimensions Data Collection: Agricultural Statistics of the People's Republic of China: 1949-1990

Data.gov (United States)

National Aeronautics and Space Administration — Agricultural Statistics of the People's Republic of China, 1949-1990 is an historical collection of agricultural statistical data compiled by China's State...
STATISTICS. The reusable holdout: Preserving validity in adaptive data analysis.

Science.gov (United States)

Dwork, Cynthia; Feldman, Vitaly; Hardt, Moritz; Pitassi, Toniann; Reingold, Omer; Roth, Aaron

2015-08-07

Misapplication of statistical data analysis is a common cause of spurious discoveries in scientific research. Existing approaches to ensuring the validity of inferences drawn from data assume a fixed procedure to be performed, selected before the data are examined. In common practice, however, data analysis is an intrinsically adaptive process, with new analyses generated on the basis of data exploration, as well as the results of previous analyses on the same data. We demonstrate a new approach for addressing the challenges of adaptivity based on insights from privacy-preserving data analysis. As an application, we show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses. Copyright © 2015, American Association for the Advancement of Science.
Network Data: Statistical Theory and New Models

Science.gov (United States)

2016-02-17

and with environmental scientists at JPL and Emory University to retrieval from NASA MISR remote sensing images aerosol index AOD for air pollution ...Beijing, May, 2013 Beijing Statistics Forum, Beijing, May, 2013 Statistics Seminar, CREST-ENSAE, Paris , March, 2013 Statistics Seminar, University...to retrieval from NASA MISR remote sensing images aerosol index AOD for air pollution monitoring and management. Satellite- retrieved Aerosol Optical
Statistical mechanics of complex neural systems and high dimensional data

International Nuclear Information System (INIS)

Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya

2013-01-01

Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks. (paper)
Statistical Analysis of Clinical Data on a Pocket Calculator, Part 2 Statistics on a Pocket Calculator, Part 2

CERN Document Server

Cleophas, Ton J

2012-01-01

The first part of this title contained all statistical tests relevant to starting clinical investigations, and included tests for continuous and binary data, power, sample size, multiple testing, variability, confounding, interaction, and reliability. The current part 2 of this title reviews methods for handling missing data, manipulated data, multiple confounders, predictions beyond observation, uncertainty of diagnostic tests, and the problems of outliers. Also robust tests, non-linear modeling , goodness of fit testing, Bhatacharya models, item response modeling, superiority testing, variab
Statistical analysis of management data

CERN Document Server

Gatignon, Hubert

2013-01-01

This book offers a comprehensive approach to multivariate statistical analyses. It provides theoretical knowledge of the concepts underlying the most important multivariate techniques and an overview of actual applications.
Securing co-operation from persons supplying statistical data

Science.gov (United States)

Aubenque, M. J.; Blaikley, R. M.; Harris, F. Fraser; Lal, R. B.; Neurdenburg, M. G.; Hernández, R. de Shelly

1954-01-01

Securing the co-operation of persons supplying information required for medical statistics is essentially a problem in human relations, and an understanding of the motivations, attitudes, and behaviour of the respondents is necessary. Before any new statistical survey is undertaken, it is suggested by Aubenque and Harris that a preliminary review be made so that the maximum use is made of existing information. Care should also be taken not to burden respondents with an overloaded questionnaire. Aubenque and Harris recommend simplified reporting. Complete population coverage is not necessary. Neurdenburg suggests that the co-operation and support of such organizations as medical associations and social security boards are important and that propaganda should be directed specifically to the groups whose co-operation is sought. Informal personal contacts are valuable and desirable, according to Blaikley, but may have adverse effects if the right kind of approach is not made. Financial payments as an incentive in securing co-operation are opposed by Neurdenburg, who proposes that only postage-free envelopes or similar small favours be granted. Blaikley and Harris, on the other hand, express the view that financial incentives may do much to gain the support of those required to furnish data; there are, however, other incentives, and full use should be made of the natural inclinations of respondents. Compulsion may be necessary in certain instances, but administrative rather than statutory measures should be adopted. Penalties, according to Aubenque, should be inflicted only when justified by imperative health requirements. The results of surveys should be made available as soon as possible to those who co-operated, and Aubenque and Harris point out that they should also be of practical value to the suppliers of the information. Greater co-operation can be secured from medical persons who have an understanding of the statistical principles involved; Aubenque and

A test statistic in the complex Wishart distribution and its application to change detection in polarimetric SAR data

DEFF Research Database (Denmark)

Conradsen, Knut; Nielsen, Allan Aasbjerg; Schou, Jesper

2003-01-01

. Based on this distribution, a test statistic for equality of two such matrices and an associated asymptotic probability for obtaining a smaller value of the test statistic are derived and applied successfully to change detection in polarimetric SAR data. In a case study, EMISAR L-band data from April 17...... to HH, VV, or HV data alone, the derived test statistic reduces to the well-known gamma likelihood-ratio test statistic. The derived test statistic and the associated significance value can be applied as a line or edge detector in fully polarimetric SAR data also....
A new statistic for the analysis of circular data in gamma-ray astronomy

Science.gov (United States)

Protheroe, R. J.

1985-01-01

A new statistic is proposed for the analysis of circular data. The statistic is designed specifically for situations where a test of uniformity is required which is powerful against alternatives in which a small fraction of the observations is grouped in a small range of directions, or phases.
Statistical analyses of the data on occupational radiation expousure at JPDR

International Nuclear Information System (INIS)

Kato, Shohei; Anazawa, Yutaka; Matsuno, Kenji; Furuta, Toshishiro; Akiyama, Isamu

1980-01-01

In the statistical analyses of the data on occupational radiation exposure at JPDR, statistical features were obtained as follows. (1) The individual doses followed log-normal distribution. (2) In the distribution of doses from one job in controlled area, the logarithm of the mean (μ) depended on the exposure rate (γ(mR/h)), and the σ correlated to the nature of the job and normally distributed. These relations were as follows. μ = 0.48 ln r-0.24, σ = 1.2 +- 0.58 (3) For the data containing different groups, the distribution of doses showed a polygonal line on the log-normal probability paper. (4) Under the dose limitation, the distribution of the doses showed asymptotic curve along the limit on the log-normal probability paper. (author)
Quantile regression for the statistical analysis of immunological data with many non-detects.

Science.gov (United States)

Eilers, Paul H C; Röder, Esther; Savelkoul, Huub F J; van Wijk, Roy Gerth

2012-07-07

Immunological parameters are hard to measure. A well-known problem is the occurrence of values below the detection limit, the non-detects. Non-detects are a nuisance, because classical statistical analyses, like ANOVA and regression, cannot be applied. The more advanced statistical techniques currently available for the analysis of datasets with non-detects can only be used if a small percentage of the data are non-detects. Quantile regression, a generalization of percentiles to regression models, models the median or higher percentiles and tolerates very high numbers of non-detects. We present a non-technical introduction and illustrate it with an implementation to real data from a clinical trial. We show that by using quantile regression, groups can be compared and that meaningful linear trends can be computed, even if more than half of the data consists of non-detects. Quantile regression is a valuable addition to the statistical methods that can be used for the analysis of immunological datasets with non-detects.
BIG-DATA and the Challenges for Statistical Inference and Economics Teaching and Learning

Directory of Open Access Journals (Sweden)

J.L. Peñaloza Figueroa

2017-04-01

Full Text Available The increasing automation in data collection, either in structured or unstructured formats, as well as the development of reading, concatenation and comparison algorithms and the growing analytical skills which characterize the era of Big Data, cannot not only be considered a technological achievement, but an organizational, methodological and analytical challenge for knowledge as well, which is necessary to generate opportunities and added value. In fact, exploiting the potential of Big-Data includes all fields of community activity; and given its ability to extract behaviour patterns, we are interested in the challenges for the field of teaching and learning, particularly in the field of statistical inference and economic theory. Big-Data can improve the understanding of concepts, models and techniques used in both statistical inference and economic theory, and it can also generate reliable and robust short and long term predictions. These facts have led to the demand for analytical capabilities, which in turn encourages teachers and students to demand access to massive information produced by individuals, companies and public and private organizations in their transactions and inter- relationships. Mass data (Big Data is changing the way people access, understand and organize knowledge, which in turn is causing a shift in the approach to statistics and economics teaching, considering them as a real way of thinking rather than just operational and technical disciplines. Hence, the question is how teachers can use automated collection and analytical skills to their advantage when teaching statistics and economics; and whether it will lead to a change in what is taught and how it is taught.
Multivariate statistical analysis of major and trace element data for ...

African Journals Online (AJOL)

Multivariate statistical analysis of major and trace element data for niobium exploration in the peralkaline granites of the anorogenic ring-complex province of Nigeria. PO Ogunleye, EC Ike, I Garba. Abstract. No Abstract Available Journal of Mining and Geology Vol.40(2) 2004: 107-117. Full Text: EMAIL FULL TEXT EMAIL ...
Exploring Foundation Concepts in Introductory Statistics Using Dynamic Data Points

Science.gov (United States)

Ekol, George

2015-01-01

This paper analyses introductory statistics students' verbal and gestural expressions as they interacted with a dynamic sketch (DS) designed using "Sketchpad" software. The DS involved numeric data points built on the number line whose values changed as the points were dragged along the number line. The study is framed on aggregate…
Multivariate statistical analysis of atom probe tomography data

International Nuclear Information System (INIS)

Parish, Chad M.; Miller, Michael K.

2010-01-01

The application of spectrum imaging multivariate statistical analysis methods, specifically principal component analysis (PCA), to atom probe tomography (APT) data has been investigated. The mathematical method of analysis is described and the results for two example datasets are analyzed and presented. The first dataset is from the analysis of a PM 2000 Fe-Cr-Al-Ti steel containing two different ultrafine precipitate populations. PCA properly describes the matrix and precipitate phases in a simple and intuitive manner. A second APT example is from the analysis of an irradiated reactor pressure vessel steel. Fine, nm-scale Cu-enriched precipitates having a core-shell structure were identified and qualitatively described by PCA. Advantages, disadvantages, and future prospects for implementing these data analysis methodologies for APT datasets, particularly with regard to quantitative analysis, are also discussed.
RADSS: an integration of GIS, spatial statistics, and network service for regional data mining

Science.gov (United States)

Hu, Haitang; Bao, Shuming; Lin, Hui; Zhu, Qing

2005-10-01

Regional data mining, which aims at the discovery of knowledge about spatial patterns, clusters or association between regions, has widely applications nowadays in social science, such as sociology, economics, epidemiology, crime, and so on. Many applications in the regional or other social sciences are more concerned with the spatial relationship, rather than the precise geographical location. Based on the spatial continuity rule derived from Tobler's first law of geography: observations at two sites tend to be more similar to each other if the sites are close together than if far apart, spatial statistics, as an important means for spatial data mining, allow the users to extract the interesting and useful information like spatial pattern, spatial structure, spatial association, spatial outlier and spatial interaction, from the vast amount of spatial data or non-spatial data. Therefore, by integrating with the spatial statistical methods, the geographical information systems will become more powerful in gaining further insights into the nature of spatial structure of regional system, and help the researchers to be more careful when selecting appropriate models. However, the lack of such tools holds back the application of spatial data analysis techniques and development of new methods and models (e.g., spatio-temporal models). Herein, we make an attempt to develop such an integrated software and apply it into the complex system analysis for the Poyang Lake Basin. This paper presents a framework for integrating GIS, spatial statistics and network service in regional data mining, as well as their implementation. After discussing the spatial statistics methods involved in regional complex system analysis, we introduce RADSS (Regional Analysis and Decision Support System), our new regional data mining tool, by integrating GIS, spatial statistics and network service. RADSS includes the functions of spatial data visualization, exploratory spatial data analysis, and
Statistical means to enhance the comparability of data within a pooled analysis of individual data in neurobehavioral toxicology

DEFF Research Database (Denmark)

Meyer-Baron, Monika; Schäper, Michael; Knapp, Guido

2011-01-01

Meta-analyses of individual participant data (IPD) provide important contributions to toxicological risk assessments. However, comparability of individual data cannot be taken for granted when information from different studies has to be summarized. By means of statistical standardization...
An improved method for statistical analysis of raw accelerator mass spectrometry data

International Nuclear Information System (INIS)

Gutjahr, A.; Phillips, F.; Kubik, P.W.; Elmore, D.

1987-01-01

Hierarchical statistical analysis is an appropriate method for statistical treatment of raw accelerator mass spectrometry (AMS) data. Using Monte Carlo simulations we show that this method yields more accurate estimates of isotope ratios and analytical uncertainty than the generally used propagation of errors approach. The hierarchical analysis is also useful in design of experiments because it can be used to identify sources of variability. 8 refs., 2 figs
Statistical interpretation of geochemical data

International Nuclear Information System (INIS)

Carambula, M.

1990-01-01

Statistical results have been obtained from a geochemical research from the following four aerial photographies Zapican, Carape, Las Canias, Alferez. They have been studied 3020 samples in total, to 22 chemical elements using plasma emission spectrometry methods.
The Development Data Book: A Guide to Social and Economic Statistics. Second Edition.

Science.gov (United States)

Sheram, Katherine

This data book presents satistics on countries with populations of more than one million. The statistics relate to economic development and the changes it is bringing about in the world. These statistics are measures of social and economic conditions in developing and industrial countries. Five indicators of economic development are presented,…
Using assemblage data in ecological indicators: A comparison and evaluation of commonly available statistical tools

Science.gov (United States)

Smith, Joseph M.; Mather, Martha E.

2012-01-01

Ecological indicators are science-based tools used to assess how human activities have impacted environmental resources. For monitoring and environmental assessment, existing species assemblage data can be used to make these comparisons through time or across sites. An impediment to using assemblage data, however, is that these data are complex and need to be simplified in an ecologically meaningful way. Because multivariate statistics are mathematical relationships, statistical groupings may not make ecological sense and will not have utility as indicators. Our goal was to define a process to select defensible and ecologically interpretable statistical simplifications of assemblage data in which researchers and managers can have confidence. For this, we chose a suite of statistical methods, compared the groupings that resulted from these analyses, identified convergence among groupings, then we interpreted the groupings using species and ecological guilds. When we tested this approach using a statewide stream fish dataset, not all statistical methods worked equally well. For our dataset, logistic regression (Log), detrended correspondence analysis (DCA), cluster analysis (CL), and non-metric multidimensional scaling (NMDS) provided consistent, simplified output. Specifically, the Log, DCA, CL-1, and NMDS-1 groupings were ≥60% similar to each other, overlapped with the fluvial-specialist ecological guild, and contained a common subset of species. Groupings based on number of species (e.g., Log, DCA, CL and NMDS) outperformed groupings based on abundance [e.g., principal components analysis (PCA) and Poisson regression]. Although the specific methods that worked on our test dataset have generality, here we are advocating a process (e.g., identifying convergent groupings with redundant species composition that are ecologically interpretable) rather than the automatic use of any single statistical tool. We summarize this process in step-by-step guidance for the
Statistical Methods for Unusual Count Data: Examples From Studies of Microchimerism

Science.gov (United States)

Guthrie, Katherine A.; Gammill, Hilary S.; Kamper-Jørgensen, Mads; Tjønneland, Anne; Gadi, Vijayakrishna K.; Nelson, J. Lee; Leisenring, Wendy

2016-01-01

Natural acquisition of small amounts of foreign cells or DNA, referred to as microchimerism, occurs primarily through maternal-fetal exchange during pregnancy. Microchimerism can persist long-term and has been associated with both beneficial and adverse human health outcomes. Quantitative microchimerism data present challenges for statistical analysis, including a skewed distribution, excess zero values, and occasional large values. Methods for comparing microchimerism levels across groups while controlling for covariates are not well established. We compared statistical models for quantitative microchimerism values, applied to simulated data sets and 2 observed data sets, to make recommendations for analytic practice. Modeling the level of quantitative microchimerism as a rate via Poisson or negative binomial model with the rate of detection defined as a count of microchimerism genome equivalents per total cell equivalents tested utilizes all available data and facilitates a comparison of rates between groups. We found that both the marginalized zero-inflated Poisson model and the negative binomial model can provide unbiased and consistent estimates of the overall association of exposure or study group with microchimerism detection rates. The negative binomial model remains the more accessible of these 2 approaches; thus, we conclude that the negative binomial model may be most appropriate for analyzing quantitative microchimerism data. PMID:27769989
Evaluation of Solid Rocket Motor Component Data Using a Commercially Available Statistical Software Package

Science.gov (United States)

Stefanski, Philip L.

2015-01-01

Commercially available software packages today allow users to quickly perform the routine evaluations of (1) descriptive statistics to numerically and graphically summarize both sample and population data, (2) inferential statistics that draws conclusions about a given population from samples taken of it, (3) probability determinations that can be used to generate estimates of reliability allowables, and finally (4) the setup of designed experiments and analysis of their data to identify significant material and process characteristics for application in both product manufacturing and performance enhancement. This paper presents examples of analysis and experimental design work that has been conducted using Statgraphics®(Registered Trademark) statistical software to obtain useful information with regard to solid rocket motor propellants and internal insulation material. Data were obtained from a number of programs (Shuttle, Constellation, and Space Launch System) and sources that include solid propellant burn rate strands, tensile specimens, sub-scale test motors, full-scale operational motors, rubber insulation specimens, and sub-scale rubber insulation analog samples. Besides facilitating the experimental design process to yield meaningful results, statistical software has demonstrated its ability to quickly perform complex data analyses and yield significant findings that might otherwise have gone unnoticed. One caveat to these successes is that useful results not only derive from the inherent power of the software package, but also from the skill and understanding of the data analyst.
Time-Dependent Statistical Analysis of Wide-Area Time-Synchronized Data

Directory of Open Access Journals (Sweden)

A. R. Messina

2010-01-01

Full Text Available Characterization of spatial and temporal changes in the dynamic patterns of a nonstationary process is a problem of great theoretical and practical importance. On-line monitoring of large-scale power systems by means of time-synchronized Phasor Measurement Units (PMUs provides the opportunity to analyze and characterize inter-system oscillations. Wide-area measurement sets, however, are often relatively large, and may contain phenomena with differing temporal scales. Extracting from these measurements the relevant dynamics is a difficult problem. As the number of observations of real events continues to increase, statistical techniques are needed to help identify relevant temporal dynamics from noise or random effects in measured data. In this paper, a statistically based, data-driven framework that integrates the use of wavelet-based EOF analysis and a sliding window-based method is proposed to identify and extract, in near-real-time, dynamically independent spatiotemporal patterns from time synchronized data. The method deals with the information in space and time simultaneously, and allows direct tracking and characterization of the nonstationary time-frequency dynamics of oscillatory processes. The efficiency and accuracy of the developed procedures for extracting localized information of power system behavior from time-synchronized phasor measurements of a real event in Mexico is assessed.
Statistical Analysis of Data for Timber Strengths

DEFF Research Database (Denmark)

Sørensen, John Dalsgaard; Hoffmeyer, P.

Statistical analyses are performed for material strength parameters from approximately 6700 specimens of structural timber. Non-parametric statistical analyses and fits to the following distributions types have been investigated: Normal, Lognormal, 2 parameter Weibull and 3-parameter Weibull...
Statistical analyses of the magnet data for the advanced photon source storage ring magnets

International Nuclear Information System (INIS)

Kim, S.H.; Carnegie, D.W.; Doose, C.; Hogrefe, R.; Kim, K.; Merl, R.

1995-01-01

The statistics of the measured magnetic data of 80 dipole, 400 quadrupole, and 280 sextupole magnets of conventional resistive designs for the APS storage ring is summarized. In order to accommodate the vacuum chamber, the curved dipole has a C-type cross section and the quadrupole and sextupole cross sections have 180 degrees and 120 degrees symmetries, respectively. The data statistics include the integrated main fields, multipole coefficients, magnetic and mechanical axes, and roll angles of the main fields. The average and rms values of the measured magnet data meet the storage ring requirements
Statistical Projections for Multi-resolution, Multi-dimensional Visual Data Exploration and Analysis

Energy Technology Data Exchange (ETDEWEB)

Nguyen, Hoa T. [Univ. of Utah, Salt Lake City, UT (United States); Stone, Daithi [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Bethel, E. Wes [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

2016-01-01

An ongoing challenge in visual exploration and analysis of large, multi-dimensional datasets is how to present useful, concise information to a user for some specific visualization tasks. Typical approaches to this problem have proposed either reduced-resolution versions of data, or projections of data, or both. These approaches still have some limitations such as consuming high computation or suffering from errors. In this work, we explore the use of a statistical metric as the basis for both projections and reduced-resolution versions of data, with a particular focus on preserving one key trait in data, namely variation. We use two different case studies to explore this idea, one that uses a synthetic dataset, and another that uses a large ensemble collection produced by an atmospheric modeling code to study long-term changes in global precipitation. The primary findings of our work are that in terms of preserving the variation signal inherent in data, that using a statistical measure more faithfully preserves this key characteristic across both multi-dimensional projections and multi-resolution representations than a methodology based upon averaging.

Linear regression models and k-means clustering for statistical analysis of fNIRS data.

Science.gov (United States)

Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

2015-02-01

We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets.
ROOT: A C++ framework for petabyte data storage, statistical analysis and visualization

International Nuclear Information System (INIS)

Antcheva, I.; Ballintijn, M.; Bellenot, B.; Biskup, M.; Brun, R.; Buncic, N.; Couet, O.; Franco, L.; Canal, Ph.; Casadei, D.; Fine, V.

2009-01-01

ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web or a number of different shared file systems. In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, the RooFit package allows the user to perform complex data modeling and fitting while the RooStats library provides abstractions and implementations for advanced statistical tools. Multivariate classification methods based on machine learning techniques are available via the TMVA package. A central piece in these analysis tools are the histogram classes which provide binning of one- and multi-dimensional data. Results can be saved in high-quality graphical formats like Postscript and PDF or in bitmap formats like JPG or GIF. The result can also be stored into ROOT macros that allow a full recreation and rework of the graphics. Users typically create their analysis macros step by step, making use of the interactive C++ interpreter CINT, while running over small data samples. Once the development is finished, they can run these macros at full compiled speed over large data sets, using on-the-fly compilation, or by creating a stand-alone batch program. Finally, if processing farms are available, the user can reduce the execution time of intrinsically parallel tasks - e.g. data mining in HEP - by using PROOF, which will take care of optimally
Data and statistical methods for analysis of trends and patterns

International Nuclear Information System (INIS)

Atwood, C.L.; Gentillon, C.D.; Wilson, G.E.

1992-11-01

This report summarizes topics considered at a working meeting on data and statistical methods for analysis of trends and patterns in US commercial nuclear power plants. This meeting was sponsored by the Office of Analysis and Evaluation of Operational Data (AEOD) of the Nuclear Regulatory Commission (NRC). Three data sets are briefly described: Nuclear Plant Reliability Data System (NPRDS), Licensee Event Report (LER) data, and Performance Indicator data. Two types of study are emphasized: screening studies, to see if any trends or patterns appear to be present; and detailed studies, which are more concerned with checking the analysis assumptions, modeling any patterns that are present, and searching for causes. A prescription is given for a screening study, and ideas are suggested for a detailed study, when the data take of any of three forms: counts of events per time, counts of events per demand, and non-event data
Descriptive statistics.

Science.gov (United States)

Nick, Todd G

2007-01-01

Statistics is defined by the Medical Subject Headings (MeSH) thesaurus as the science and art of collecting, summarizing, and analyzing data that are subject to random variation. The two broad categories of summarizing and analyzing data are referred to as descriptive and inferential statistics. This chapter considers the science and art of summarizing data where descriptive statistics and graphics are used to display data. In this chapter, we discuss the fundamentals of descriptive statistics, including describing qualitative and quantitative variables. For describing quantitative variables, measures of location and spread, for example the standard deviation, are presented along with graphical presentations. We also discuss distributions of statistics, for example the variance, as well as the use of transformations. The concepts in this chapter are useful for uncovering patterns within the data and for effectively presenting the results of a project.
Metaviz: interactive statistical and visual analysis of metagenomic data.

Science.gov (United States)

Wagner, Justin; Chelaru, Florin; Kancherla, Jayaram; Paulson, Joseph N; Zhang, Alexander; Felix, Victor; Mahurkar, Anup; Elmqvist, Niklas; Corrada Bravo, Héctor

2018-04-06

Large studies profiling microbial communities and their association with healthy or disease phenotypes are now commonplace. Processed data from many of these studies are publicly available but significant effort is required for users to effectively organize, explore and integrate it, limiting the utility of these rich data resources. Effective integrative and interactive visual and statistical tools to analyze many metagenomic samples can greatly increase the value of these data for researchers. We present Metaviz, a tool for interactive exploratory data analysis of annotated microbiome taxonomic community profiles derived from marker gene or whole metagenome shotgun sequencing. Metaviz is uniquely designed to address the challenge of browsing the hierarchical structure of metagenomic data features while rendering visualizations of data values that are dynamically updated in response to user navigation. We use Metaviz to provide the UMD Metagenome Browser web service, allowing users to browse and explore data for more than 7000 microbiomes from published studies. Users can also deploy Metaviz as a web service, or use it to analyze data through the metavizr package to interoperate with state-of-the-art analysis tools available through Bioconductor. Metaviz is free and open source with the code, documentation and tutorials publicly accessible.
Data Analysis & Statistical Methods for Command File Errors

Science.gov (United States)

Meshkat, Leila; Waggoner, Bruce; Bryant, Larry

2014-01-01

This paper explains current work on modeling for managing the risk of command file errors. It is focused on analyzing actual data from a JPL spaceflight mission to build models for evaluating and predicting error rates as a function of several key variables. We constructed a rich dataset by considering the number of errors, the number of files radiated, including the number commands and blocks in each file, as well as subjective estimates of workload and operational novelty. We have assessed these data using different curve fitting and distribution fitting techniques, such as multiple regression analysis, and maximum likelihood estimation to see how much of the variability in the error rates can be explained with these. We have also used goodness of fit testing strategies and principal component analysis to further assess our data. Finally, we constructed a model of expected error rates based on the what these statistics bore out as critical drivers to the error rate. This model allows project management to evaluate the error rate against a theoretically expected rate as well as anticipate future error rates.
Investigating spousal concordance of diabetes through statistical analysis and data mining.

Directory of Open Access Journals (Sweden)

Jong-Yi Wang

Full Text Available Spousal clustering of diabetes merits attention. Whether old-age vulnerability or a shared family environment determines the concordance of diabetes is also uncertain. This study investigated the spousal concordance of diabetes and compared the risk of diabetes concordance between couples and noncouples by using nationally representative data.A total of 22,572 individuals identified from the 2002-2013 National Health Insurance Research Database of Taiwan constituted 5,643 couples and 5,643 noncouples through 1:1 dual propensity score matching (PSM. Factors associated with concordance in both spouses with diabetes were analyzed at the individual level. The risk of diabetes concordance between couples and noncouples was compared at the couple level. Logistic regression was the main statistical method. Statistical data were analyzed using SAS 9.4. C&RT and Apriori of data mining conducted in IBM SPSS Modeler 13 served as a supplement to statistics.High odds of the spousal concordance of diabetes were associated with old age, middle levels of urbanization, and high comorbidities (all P < 0.05. The dual PSM analysis revealed that the risk of diabetes concordance was significantly higher in couples (5.19% than in noncouples (0.09%; OR = 61.743, P < 0.0001.A high concordance rate of diabetes in couples may indicate the influences of assortative mating and shared environment. Diabetes in a spouse implicates its risk in the partner. Family-based diabetes care that emphasizes the screening of couples at risk of diabetes by using the identified risk factors is suggested in prospective clinical practice interventions.
The Statistic Test on Influence of Surface Treatment to Fatigue Lifetime with Limited Data

OpenAIRE

Suhartono, Agus

2009-01-01

Justifications on the influences of two or more parameters on fatigue strength are some times problematic due to the scatter nature of the fatigue data. Statistic test can facilitate the evaluation, whether the changes in material characteristics as a result of specific parameters of interest is significant. The statistic tests were applied to fatigue data of AISI 1045 steel specimens. The specimens are consisted of as received specimen, shot peened specimen with 15 and 16 Almen intensity as ...
Scripts for TRUMP data analyses. Part II (HLA-related data): statistical analyses specific for hematopoietic stem cell transplantation.

Science.gov (United States)

Kanda, Junya

2016-01-01

The Transplant Registry Unified Management Program (TRUMP) made it possible for members of the Japan Society for Hematopoietic Cell Transplantation (JSHCT) to analyze large sets of national registry data on autologous and allogeneic hematopoietic stem cell transplantation. However, as the processes used to collect transplantation information are complex and differed over time, the background of these processes should be understood when using TRUMP data. Previously, information on the HLA locus of patients and donors had been collected using a questionnaire-based free-description method, resulting in some input errors. To correct minor but significant errors and provide accurate HLA matching data, the use of a Stata or EZR/R script offered by the JSHCT is strongly recommended when analyzing HLA data in the TRUMP dataset. The HLA mismatch direction, mismatch counting method, and different impacts of HLA mismatches by stem cell source are other important factors in the analysis of HLA data. Additionally, researchers should understand the statistical analyses specific for hematopoietic stem cell transplantation, such as competing risk, landmark analysis, and time-dependent analysis, to correctly analyze transplant data. The data center of the JSHCT can be contacted if statistical assistance is required.
Assessing Research Data Deposits and Usage Statistics within IDEALS

Directory of Open Access Journals (Sweden)

Christie A. Wiley

2017-12-01

Full Text Available Objectives:This study follows up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1 What is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign (UIUC campus repository? Are datasets more likely to be single-file or multiple-file items? (2 What is the usage data associated with these datasets? Which items are most popular? Methods: The dataset records collected in this study were identified by filtering item types categorized as “data” or “dataset” using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item’s statistics report. The Handle identifier represents the dataset record’s persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository. Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS. Results: A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first timeframe a large number of PDFs were deposited by the Illinois Department of Agriculture. Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single-file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per
Statistical analysis of environmental dose data for Trombay environment

International Nuclear Information System (INIS)

Kale, M.S.; Padmanabhan, N.; Rekha Kutty, R.; Sharma, D.N.; Iyengar, T.S.; Iyer, M.R.

1993-01-01

The microprocessor based environmental dose logging system is functioning at six stations at Trombay for the past couple of years. The site emergency control centre (SECC) at modular laboratory receives telemetered data every five minutes from main guard house (South Site), Bhabha point (top of the hill), Cirus reactor, Mod Lab terrace, Hall No. 7 and Training School Hostel. The data collected are being stored in dbase III + format for easy processing in a PC. Various statistical parameters and distributions of environmental gamma dose are determined from the hourly dose data. On the basis of the reactor operation status an attempt has been made to separate the natural background and the gamma dose contribution due to the operating research reactors in each one of these monitoring stations. Similar investigations are being carried out for Tarapur environment. (author). 2 refs., 3 tabs., 2 figs
Understanding spatial organizations of chromosomes via statistical analysis of Hi-C data

Science.gov (United States)

Hu, Ming; Deng, Ke; Qin, Zhaohui; Liu, Jun S.

2015-01-01

Understanding how chromosomes fold provides insights into the transcription regulation, hence, the functional state of the cell. Using the next generation sequencing technology, the recently developed Hi-C approach enables a global view of spatial chromatin organization in the nucleus, which substantially expands our knowledge about genome organization and function. However, due to multiple layers of biases, noises and uncertainties buried in the protocol of Hi-C experiments, analyzing and interpreting Hi-C data poses great challenges, and requires novel statistical methods to be developed. This article provides an overview of recent Hi-C studies and their impacts on biomedical research, describes major challenges in statistical analysis of Hi-C data, and discusses some perspectives for future research. PMID:26124977
ODM Data Analysis-A tool for the automatic validation, monitoring and generation of generic descriptive statistics of patient data.

Science.gov (United States)

Brix, Tobias Johannes; Bruland, Philipp; Sarfraz, Saad; Ernsting, Jan; Neuhaus, Philipp; Storck, Michael; Doods, Justin; Ständer, Sonja; Dugas, Martin

2018-01-01

A required step for presenting results of clinical studies is the declaration of participants demographic and baseline characteristics as claimed by the FDAAA 801. The common workflow to accomplish this task is to export the clinical data from the used electronic data capture system and import it into statistical software like SAS software or IBM SPSS. This software requires trained users, who have to implement the analysis individually for each item. These expenditures may become an obstacle for small studies. Objective of this work is to design, implement and evaluate an open source application, called ODM Data Analysis, for the semi-automatic analysis of clinical study data. The system requires clinical data in the CDISC Operational Data Model format. After uploading the file, its syntax and data type conformity of the collected data is validated. The completeness of the study data is determined and basic statistics, including illustrative charts for each item, are generated. Datasets from four clinical studies have been used to evaluate the application's performance and functionality. The system is implemented as an open source web application (available at https://odmanalysis.uni-muenster.de) and also provided as Docker image which enables an easy distribution and installation on local systems. Study data is only stored in the application as long as the calculations are performed which is compliant with data protection endeavors. Analysis times are below half an hour, even for larger studies with over 6000 subjects. Medical experts have ensured the usefulness of this application to grant an overview of their collected study data for monitoring purposes and to generate descriptive statistics without further user interaction. The semi-automatic analysis has its limitations and cannot replace the complex analysis of statisticians, but it can be used as a starting point for their examination and reporting.
Vector-field statistics for the analysis of time varying clinical gait data.

Science.gov (United States)

Donnelly, C J; Alexander, C; Pataky, T C; Stannage, K; Reid, S; Robinson, M A

2017-01-01

In clinical settings, the time varying analysis of gait data relies heavily on the experience of the individual(s) assessing these biological signals. Though three dimensional kinematics are recognised as time varying waveforms (1D), exploratory statistical analysis of these data are commonly carried out with multiple discrete or 0D dependent variables. In the absence of an a priori 0D hypothesis, clinicians are at risk of making type I and II errors in their analyis of time varying gait signatures in the event statistics are used in concert with prefered subjective clinical assesment methods. The aim of this communication was to determine if vector field waveform statistics were capable of providing quantitative corroboration to practically significant differences in time varying gait signatures as determined by two clinically trained gait experts. The case study was a left hemiplegic Cerebral Palsy (GMFCS I) gait patient following a botulinum toxin (BoNT-A) injection to their left gastrocnemius muscle. When comparing subjective clinical gait assessments between two testers, they were in agreement with each other for 61% of the joint degrees of freedom and phases of motion analysed. For tester 1 and tester 2, they were in agreement with the vector-field analysis for 78% and 53% of the kinematic variables analysed. When the subjective analyses of tester 1 and tester 2 were pooled together and then compared to the vector-field analysis, they were in agreement for 83% of the time varying kinematic variables analysed. These outcomes demonstrate that in principle, vector-field statistics corroborates with what a team of clinical gait experts would classify as practically meaningful pre- versus post time varying kinematic differences. The potential for vector-field statistics to be used as a useful clinical tool for the objective analysis of time varying clinical gait data is established. Future research is recommended to assess the usefulness of vector-field analyses
Statistical evaluation of Pacific Northwest Residential Energy Consumption Survey weather data

Energy Technology Data Exchange (ETDEWEB)

Tawil, J.J.

1986-02-01

This report addresses an issue relating to energy consumption and conservation in the residential sector. BPA has obtained two meteorological data bases for use with its 1983 Pacific Northwest Residential Energy Survey (PNWRES). One data base consists of temperature data from weather stations; these have been aggregated to form a second data base that covers the National Oceanographic and Atmospheric Administration (NOAA) climatic divisions. At BPA's request, Pacific Northwest Laboratory has produced a household energy use model for both electricity and natural gas in order to determine whether the statistically estimated parameters of the model significantly differ when the two different meteorological data bases are used.
Box-Cox transformation of firm size data in statistical analysis

Science.gov (United States)

Chen, Ting Ting; Takaishi, Tetsuya

2014-03-01

Firm size data usually do not show the normality that is often assumed in statistical analysis such as regression analysis. In this study we focus on two firm size data: the number of employees and sale. Those data deviate considerably from a normal distribution. To improve the normality of those data we transform them by the Box-Cox transformation with appropriate parameters. The Box-Cox transformation parameters are determined so that the transformed data best show the kurtosis of a normal distribution. It is found that the two firm size data transformed by the Box-Cox transformation show strong linearity. This indicates that the number of employees and sale have the similar property as a firm size indicator. The Box-Cox parameters obtained for the firm size data are found to be very close to zero. In this case the Box-Cox transformations are approximately a log-transformation. This suggests that the firm size data we used are approximately log-normal distributions.
Box-Cox transformation of firm size data in statistical analysis

International Nuclear Information System (INIS)

Chen, Ting Ting; Takaishi, Tetsuya

2014-01-01

Firm size data usually do not show the normality that is often assumed in statistical analysis such as regression analysis. In this study we focus on two firm size data: the number of employees and sale. Those data deviate considerably from a normal distribution. To improve the normality of those data we transform them by the Box-Cox transformation with appropriate parameters. The Box-Cox transformation parameters are determined so that the transformed data best show the kurtosis of a normal distribution. It is found that the two firm size data transformed by the Box-Cox transformation show strong linearity. This indicates that the number of employees and sale have the similar property as a firm size indicator. The Box-Cox parameters obtained for the firm size data are found to be very close to zero. In this case the Box-Cox transformations are approximately a log-transformation. This suggests that the firm size data we used are approximately log-normal distributions
MONITORING INTERNATIONAL MIGRATION FLOWS IN EUROPE - TOWARDS A STATISTICAL-DATA BASE COMBINING DATA FROM DIFFERENT SOURCES

NARCIS (Netherlands)

WILLEKENS, F

1994-01-01

The paper reviews techniques developed in demography, geography and statistics that are useful for bridging the gap between available data on international migration flows and the information required for policy making and research. The basic idea of the paper is as follows: to establish a coherent
Sharing Privacy Protected and Statistically Sound Clinical Research Data Using Outsourced Data Storage

Directory of Open Access Journals (Sweden)

Geontae Noh

2014-01-01

Full Text Available It is critical to scientific progress to share clinical research data stored in outsourced generally available cloud computing services. Researchers are able to obtain valuable information that they would not otherwise be able to access; however, privacy concerns arise when sharing clinical data in these outsourced publicly available data storage services. HIPAA requires researchers to deidentify private information when disclosing clinical data for research purposes and describes two available methods for doing so. Unfortunately, both techniques degrade statistical accuracy. Therefore, the need to protect privacy presents a significant problem for data sharing between hospitals and researchers. In this paper, we propose a controlled secure aggregation protocol to secure both privacy and accuracy when researchers outsource their clinical research data for sharing. Since clinical data must remain private beyond a patient’s lifetime, we take advantage of lattice-based homomorphic encryption to guarantee long-term security against quantum computing attacks. Using lattice-based homomorphic encryption, we design an aggregation protocol that aggregates outsourced ciphertexts under distinct public keys. It enables researchers to get aggregated results from outsourced ciphertexts of distinct researchers. To the best of our knowledge, our protocol is the first aggregation protocol which can aggregate ciphertexts which are encrypted with distinct public keys.
Information gathering for the Transportation Statistics Data Bank

International Nuclear Information System (INIS)

Shappert, L.B.; Mason, P.J.

1981-10-01

The Transportation Statistics Data Bank (TSDB) was developed in 1974 to collect information on the transport of Department of Energy (DOE) materials. This computer program may be used to provide the framework for collecting more detailed information on DOE shipments of radioactive materials. This report describes the type of information that is needed in this area and concludes that the existing system could be readily modified to collect and process it. The additional needed information, available from bills of lading and similar documents, could be gathered from DOE field offices and transferred in a standard format to the TSDB system. Costs of the system are also discussed briefly

JAWS data collection, analysis highlights, and microburst statistics

Science.gov (United States)

Mccarthy, J.; Roberts, R.; Schreiber, W.

1983-01-01

Organization, equipment, and the current status of the Joint Airport Weather Studies project initiated in relation to the microburst phenomenon are summarized. Some data collection techniques and preliminary statistics on microburst events recorded by Doppler radar are discussed as well. Radar studies show that microbursts occur much more often than expected, with majority of the events being potentially dangerous to landing or departing aircraft. Seventy events were registered, with the differential velocities ranging from 10 to 48 m/s; headwind/tailwind velocity differentials over 20 m/s are considered seriously hazardous. It is noted that a correlation is yet to be established between the velocity differential and incoherent radar reflectivity.
Isocount scintillation scanner with preset statistical data reliability

International Nuclear Information System (INIS)

Ikebe, J.; Yamaguchi, H.; Nawa, O.A.

1975-01-01

A scintillation detector scans an object such as a live body along horizontal straight scanning lines in such a manner that the scintillation detector is stopped at a scanning point during the time interval T required for counting a predetermined number of N pulses. The rate R/sub N/ = N/T is then calculated and the output signal pulses the number of which represents the rate R or the corresponding output signal is used as the recording signal for forming the scintigram. In contrast to the usual scanner, the isocount scanner scans an object stepwise in order to gather data with statistically uniform reliability
Multiple point statistical simulation using uncertain (soft) conditional data

Science.gov (United States)

Hansen, Thomas Mejer; Vu, Le Thanh; Mosegaard, Klaus; Cordua, Knud Skou

2018-05-01

Geostatistical simulation methods have been used to quantify spatial variability of reservoir models since the 80s. In the last two decades, state of the art simulation methods have changed from being based on covariance-based 2-point statistics to multiple-point statistics (MPS), that allow simulation of more realistic Earth-structures. In addition, increasing amounts of geo-information (geophysical, geological, etc.) from multiple sources are being collected. This pose the problem of integration of these different sources of information, such that decisions related to reservoir models can be taken on an as informed base as possible. In principle, though difficult in practice, this can be achieved using computationally expensive Monte Carlo methods. Here we investigate the use of sequential simulation based MPS simulation methods conditional to uncertain (soft) data, as a computational efficient alternative. First, it is demonstrated that current implementations of sequential simulation based on MPS (e.g. SNESIM, ENESIM and Direct Sampling) do not account properly for uncertain conditional information, due to a combination of using only co-located information, and a random simulation path. Then, we suggest two approaches that better account for the available uncertain information. The first make use of a preferential simulation path, where more informed model parameters are visited preferentially to less informed ones. The second approach involves using non co-located uncertain information. For different types of available data, these approaches are demonstrated to produce simulation results similar to those obtained by the general Monte Carlo based approach. These methods allow MPS simulation to condition properly to uncertain (soft) data, and hence provides a computationally attractive approach for integration of information about a reservoir model.
A random-sum Wilcoxon statistic and its application to analysis of ROC and LROC data.

Science.gov (United States)

Tang, Liansheng Larry; Balakrishnan, N

2011-01-01

The Wilcoxon-Mann-Whitney statistic is commonly used for a distribution-free comparison of two groups. One requirement for its use is that the sample sizes of the two groups are fixed. This is violated in some of the applications such as medical imaging studies and diagnostic marker studies; in the former, the violation occurs since the number of correctly localized abnormal images is random, while in the latter the violation is due to some subjects not having observable measurements. For this reason, we propose here a random-sum Wilcoxon statistic for comparing two groups in the presence of ties, and derive its variance as well as its asymptotic distribution for large sample sizes. The proposed statistic includes the regular Wilcoxon rank-sum statistic. Finally, we apply the proposed statistic for summarizing location response operating characteristic data from a liver computed tomography study, and also for summarizing diagnostic accuracy of biomarker data.
Statistical analysis on experimental calibration data for flowmeters in pressure pipes

Science.gov (United States)

Lazzarin, Alessandro; Orsi, Enrico; Sanfilippo, Umberto

2017-08-01

This paper shows a statistical analysis on experimental calibration data for flowmeters (i.e.: electromagnetic, ultrasonic, turbine flowmeters) in pressure pipes. The experimental calibration data set consists of the whole archive of the calibration tests carried out on 246 flowmeters from January 2001 to October 2015 at Settore Portate of Laboratorio di Idraulica “G. Fantoli” of Politecnico di Milano, that is accredited as LAT 104 for a flow range between 3 l/s and 80 l/s, with a certified Calibration and Measurement Capability (CMC) - formerly known as Best Measurement Capability (BMC) - equal to 0.2%. The data set is split into three subsets, respectively consisting in: 94 electromagnetic, 83 ultrasonic and 69 turbine flowmeters; each subset is analysed separately from the others, but then a final comparison is carried out. In particular, the main focus of the statistical analysis is the correction C, that is the difference between the flow rate Q measured by the calibration facility (through the accredited procedures and the certified reference specimen) minus the flow rate QM contemporarily recorded by the flowmeter under calibration, expressed as a percentage of the same QM .
Maximum-likelihood fitting of data dominated by Poisson statistical uncertainties

International Nuclear Information System (INIS)

Stoneking, M.R.; Den Hartog, D.J.

1996-06-01

The fitting of data by χ 2 -minimization is valid only when the uncertainties in the data are normally distributed. When analyzing spectroscopic or particle counting data at very low signal level (e.g., a Thomson scattering diagnostic), the uncertainties are distributed with a Poisson distribution. The authors have developed a maximum-likelihood method for fitting data that correctly treats the Poisson statistical character of the uncertainties. This method maximizes the total probability that the observed data are drawn from the assumed fit function using the Poisson probability function to determine the probability for each data point. The algorithm also returns uncertainty estimates for the fit parameters. They compare this method with a χ 2 -minimization routine applied to both simulated and real data. Differences in the returned fits are greater at low signal level (less than ∼20 counts per measurement). the maximum-likelihood method is found to be more accurate and robust, returning a narrower distribution of values for the fit parameters with fewer outliers
A scan statistic to extract causal gene clusters from case-control genome-wide rare CNV data

Directory of Open Access Journals (Sweden)

Scherer Stephen W

2011-05-01

Full Text Available Abstract Background Several statistical tests have been developed for analyzing genome-wide association data by incorporating gene pathway information in terms of gene sets. Using these methods, hundreds of gene sets are typically tested, and the tested gene sets often overlap. This overlapping greatly increases the probability of generating false positives, and the results obtained are difficult to interpret, particularly when many gene sets show statistical significance. Results We propose a flexible statistical framework to circumvent these problems. Inspired by spatial scan statistics for detecting clustering of disease occurrence in the field of epidemiology, we developed a scan statistic to extract disease-associated gene clusters from a whole gene pathway. Extracting one or a few significant gene clusters from a global pathway limits the overall false positive probability, which results in increased statistical power, and facilitates the interpretation of test results. In the present study, we applied our method to genome-wide association data for rare copy-number variations, which have been strongly implicated in common diseases. Application of our method to a simulated dataset demonstrated the high accuracy of this method in detecting disease-associated gene clusters in a whole gene pathway. Conclusions The scan statistic approach proposed here shows a high level of accuracy in detecting gene clusters in a whole gene pathway. This study has provided a sound statistical framework for analyzing genome-wide rare CNV data by incorporating topological information on the gene pathway.
A system for classifying wood-using industries and recording statistics for automatic data processing.

Science.gov (United States)

E.W. Fobes; R.W. Rowe

1968-01-01

A system for classifying wood-using industries and recording pertinent statistics for automatic data processing is described. Forms and coding instructions for recording data of primary processing plants are included.
A functional U-statistic method for association analysis of sequencing data.

Science.gov (United States)

Jadhav, Sneha; Tong, Xiaoran; Lu, Qing

2017-11-01

Although sequencing studies hold great promise for uncovering novel variants predisposing to human diseases, the high dimensionality of the sequencing data brings tremendous challenges to data analysis. Moreover, for many complex diseases (e.g., psychiatric disorders) multiple related phenotypes are collected. These phenotypes can be different measurements of an underlying disease, or measurements characterizing multiple related diseases for studying common genetic mechanism. Although jointly analyzing these phenotypes could potentially increase the power of identifying disease-associated genes, the different types of phenotypes pose challenges for association analysis. To address these challenges, we propose a nonparametric method, functional U-statistic method (FU), for multivariate analysis of sequencing data. It first constructs smooth functions from individuals' sequencing data, and then tests the association of these functions with multiple phenotypes by using a U-statistic. The method provides a general framework for analyzing various types of phenotypes (e.g., binary and continuous phenotypes) with unknown distributions. Fitting the genetic variants within a gene using a smoothing function also allows us to capture complexities of gene structure (e.g., linkage disequilibrium, LD), which could potentially increase the power of association analysis. Through simulations, we compared our method to the multivariate outcome score test (MOST), and found that our test attained better performance than MOST. In a real data application, we apply our method to the sequencing data from Minnesota Twin Study (MTS) and found potential associations of several nicotine receptor subunit (CHRN) genes, including CHRNB3, associated with nicotine dependence and/or alcohol dependence. © 2017 WILEY PERIODICALS, INC.
Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation.

Science.gov (United States)

Yigzaw, Kassaye Yitbarek; Michalas, Antonis; Bellika, Johan Gustav

2017-01-03

Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N - 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians.
Review of Naked Statistics: Stripping the Dread from Data by Charles Wheelan

Directory of Open Access Journals (Sweden)

Michael T. Catalano

2015-01-01

Full Text Available Wheelan, Charles. Naked Statistics: Stripping the Dread from Data (New York, NY, W. W. Norton & Company, 2014. 282 pp. ISBN 978-0-393-07195-5 In his review of What Numbers Say and The Numbers Game, Rob Root (Numeracy 3(1: 9 writes “Popular books on quantitative literacy need to be easy to read, reasonably comprehensive in scope, and include examples that are thought-provoking and memorable.” Wheelan’s book certainly meets this description, and should be of interest to both the general public and those with a professional interest in numeracy. A moderately diligent learner can get a decent understanding of basic statistics from the book. Teachers of statistics and quantitative literacy will find a wealth of well-related examples and stories to use in their classes.
A knowledge-based T2-statistic to perform pathway analysis for quantitative proteomic data.

Science.gov (United States)

Lai, En-Yu; Chen, Yi-Hau; Wu, Kun-Pin

2017-06-01

Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https
Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies

Directory of Open Access Journals (Sweden)

Hamid Reza Marateb

2014-01-01

Full Text Available Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal-variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD. Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables.
Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies

Science.gov (United States)

Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario

2014-01-01

Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565
Statistical comparisons of Savannah River anemometer data applied to quality control of instrument networks

International Nuclear Information System (INIS)

Porch, W.M.; Dickerson, M.H.

1976-08-01

Continuous monitoring of extensive meteorological instrument arrays is a requirement in the study of important mesoscale atmospheric phenomena. The phenomena include pollution transport prediction from continuous area sources, or one time releases of toxic materials and wind energy prospecting in areas of topographic enhancement of the wind. Quality control techniques that can be applied to these data to determine if the instruments are operating within their prescribed tolerances were investigated. Savannah River Plant data were analyzed with both independent and comparative statistical techniques. The independent techniques calculate the mean, standard deviation, moments about the mean, kurtosis, skewness, probability density distribution, cumulative probability and power spectra. The comparative techniques include covariance, cross-spectral analysis and two dimensional probability density. At present the calculating and plotting routines for these statistical techniques do not reside in a single code so it is difficult to ascribe independent memory size and computation time accurately. However, given the flexibility of a data system which includes simple and fast running statistics at the instrument end of the data network (ASF) and more sophisticated techniques at the computational end (ACF) a proper balance will be attained. These techniques are described in detail and preliminary results are presented
Statistical Literacy for Active Citizenship: A Call for Data Science Education

Science.gov (United States)

Engel, Joachim

2017-01-01

Data are abundant, quantitative information about the state of society and the wider world is around us more than ever. Paradoxically, recent trends in the public discourse point towards a post-factual world that seems content to ignore or misrepresent empirical evidence. As statistics educators we are challenged to promote understanding of…
Test Statistics and Confidence Intervals to Establish Noninferiority between Treatments with Ordinal Categorical Data.

Science.gov (United States)

Zhang, Fanghong; Miyaoka, Etsuo; Huang, Fuping; Tanaka, Yutaka

2015-01-01

The problem for establishing noninferiority is discussed between a new treatment and a standard (control) treatment with ordinal categorical data. A measure of treatment effect is used and a method of specifying noninferiority margin for the measure is provided. Two Z-type test statistics are proposed where the estimation of variance is constructed under the shifted null hypothesis using U-statistics. Furthermore, the confidence interval and the sample size formula are given based on the proposed test statistics. The proposed procedure is applied to a dataset from a clinical trial. A simulation study is conducted to compare the performance of the proposed test statistics with that of the existing ones, and the results show that the proposed test statistics are better in terms of the deviation from nominal level and the power.
Statistical Diversions

Science.gov (United States)

Petocz, Peter; Sowey, Eric

2012-01-01

The term "data snooping" refers to the practice of choosing which statistical analyses to apply to a set of data after having first looked at those data. Data snooping contradicts a fundamental precept of applied statistics, that the scheme of analysis is to be planned in advance. In this column, the authors shall elucidate the…
Reasoning with data an introduction to traditional and Bayesian statistics using R

CERN Document Server

Stanton, Jeffrey M

2017-01-01

Engaging and accessible, this book teaches readers how to use inferential statistical thinking to check their assumptions, assess evidence about their beliefs, and avoid overinterpreting results that may look more promising than they really are. It provides step-by-step guidance for using both classical (frequentist) and Bayesian approaches to inference. Statistical techniques covered side by side from both frequentist and Bayesian approaches include hypothesis testing, replication, analysis of variance, calculation of effect sizes, regression, time series analysis, and more. Students also get a complete introduction to the open-source R programming language and its key packages. Throughout the text, simple commands in R demonstrate essential data analysis skills using real-data examples. The companion website provides annotated R code for the book's examples, in-class exercises, supplemental reading lists, and links to online videos, interactive materials, and other resources.
A global approach to estimate irrigated areas - a comparison between different data and statistics

Science.gov (United States)

Meier, Jonas; Zabel, Florian; Mauser, Wolfram

2018-02-01

Agriculture is the largest global consumer of water. Irrigated areas constitute 40 % of the total area used for agricultural production (FAO, 2014a) Information on their spatial distribution is highly relevant for regional water management and food security. Spatial information on irrigation is highly important for policy and decision makers, who are facing the transition towards more efficient sustainable agriculture. However, the mapping of irrigated areas still represents a challenge for land use classifications, and existing global data sets differ strongly in their results. The following study tests an existing irrigation map based on statistics and extends the irrigated area using ancillary data. The approach processes and analyzes multi-temporal normalized difference vegetation index (NDVI) SPOT-VGT data and agricultural suitability data - both at a spatial resolution of 30 arcsec - incrementally in a multiple decision tree. It covers the period from 1999 to 2012. The results globally show a 18 % larger irrigated area than existing approaches based on statistical data. The largest differences compared to the official national statistics are found in Asia and particularly in China and India. The additional areas are mainly identified within already known irrigated regions where irrigation is more dense than previously estimated. The validation with global and regional products shows the large divergence of existing data sets with respect to size and distribution of irrigated areas caused by spatial resolution, the considered time period and the input data and assumption made.

DATA MINING AND STATISTICS METHODS USAGE FOR ADVANCED TRAINING COURSES QUALITY MEASUREMENT: CASE STUDY

Directory of Open Access Journals (Sweden)

Maxim I. Galchenko

2014-01-01

Full Text Available In the article we consider a case of the analysis of the data connected with educational statistics, namely – result of professional development courses students survey with specialized software usage. Need for expanded statistical results processing, the scheme of carrying out the analysis is shown. Conclusions on a studied case are presented.
Data analysis and graphing in an introductory physics laboratory: spreadsheet versus statistics suite

International Nuclear Information System (INIS)

Peterlin, Primoz

2010-01-01

Two methods of data analysis are compared: spreadsheet software and a statistics software suite. Their use is compared analysing data collected in three selected experiments taken from an introductory physics laboratory, which include a linear dependence, a nonlinear dependence and a histogram. The merits of each method are compared.
Neutron stars in the light of SKA: Data, statistics, and science

Indian Academy of Sciences (India)

8

2016-09-10

Sep 10, 2016 ... neutron star astrophysics: Through the case studies presented here, we hope to convey the challenges involved in devising or adopting statistical methods in the light of the .... The specific tests we applied to a recent version of a glitch dataset ... model the pulse energy data, a robust multivariate method to ...
Exploratory study on a statistical method to analyse time resolved data obtained during nanomaterial exposure measurements

International Nuclear Information System (INIS)

Clerc, F; Njiki-Menga, G-H; Witschger, O

2013-01-01

Most of the measurement strategies that are suggested at the international level to assess workplace exposure to nanomaterials rely on devices measuring, in real time, airborne particles concentrations (according different metrics). Since none of the instruments to measure aerosols can distinguish a particle of interest to the background aerosol, the statistical analysis of time resolved data requires special attention. So far, very few approaches have been used for statistical analysis in the literature. This ranges from simple qualitative analysis of graphs to the implementation of more complex statistical models. To date, there is still no consensus on a particular approach and the current period is always looking for an appropriate and robust method. In this context, this exploratory study investigates a statistical method to analyse time resolved data based on a Bayesian probabilistic approach. To investigate and illustrate the use of the this statistical method, particle number concentration data from a workplace study that investigated the potential for exposure via inhalation from cleanout operations by sandpapering of a reactor producing nanocomposite thin films have been used. In this workplace study, the background issue has been addressed through the near-field and far-field approaches and several size integrated and time resolved devices have been used. The analysis of the results presented here focuses only on data obtained with two handheld condensation particle counters. While one was measuring at the source of the released particles, the other one was measuring in parallel far-field. The Bayesian probabilistic approach allows a probabilistic modelling of data series, and the observed task is modelled in the form of probability distributions. The probability distributions issuing from time resolved data obtained at the source can be compared with the probability distributions issuing from the time resolved data obtained far-field, leading in a
A simple method to downscale daily wind statistics to hourly wind data

OpenAIRE

Guo, Zhongling

2013-01-01

Wind is the principal driver in the wind erosion models. The hourly wind speed data were generally required for precisely wind erosion modeling. In this study, a simple method to generate hourly wind speed data from daily wind statistics (daily average and maximum wind speeds together or daily average wind speed only) was established. A typical windy location with 3285 days (9 years) measured hourly wind speed data were used to validate the downscaling method. The results showed that the over...
Maximum Likelihood, Consistency and Data Envelopment Analysis: A Statistical Foundation

OpenAIRE

Rajiv D. Banker

1993-01-01

This paper provides a formal statistical basis for the efficiency evaluation techniques of data envelopment analysis (DEA). DEA estimators of the best practice monotone increasing and concave production function are shown to be also maximum likelihood estimators if the deviation of actual output from the efficient output is regarded as a stochastic variable with a monotone decreasing probability density function. While the best practice frontier estimator is biased below the theoretical front...
Fundamentals and Catalytic Innovation: The Statistical and Data Management Center of the Antibacterial Resistance Leadership Group.

Science.gov (United States)

Huvane, Jacqueline; Komarow, Lauren; Hill, Carol; Tran, Thuy Tien T; Pereira, Carol; Rosenkranz, Susan L; Finnemeyer, Matt; Earley, Michelle; Jiang, Hongyu Jeanne; Wang, Rui; Lok, Judith; Evans, Scott R

2017-03-15

The Statistical and Data Management Center (SDMC) provides the Antibacterial Resistance Leadership Group (ARLG) with statistical and data management expertise to advance the ARLG research agenda. The SDMC is active at all stages of a study, including design; data collection and monitoring; data analyses and archival; and publication of study results. The SDMC enhances the scientific integrity of ARLG studies through the development and implementation of innovative and practical statistical methodologies and by educating research colleagues regarding the application of clinical trial fundamentals. This article summarizes the challenges and roles, as well as the innovative contributions in the design, monitoring, and analyses of clinical trials and diagnostic studies, of the ARLG SDMC. © The Author 2017. Published by Oxford University Press for the Infectious Diseases Society of America. All rights reserved. For permissions, e-mail: journals.permissions@oup.com.
Inference on network statistics by restricting to the network space: applications to sexual history data.

Science.gov (United States)

Goyal, Ravi; De Gruttola, Victor

2018-01-30

Analysis of sexual history data intended to describe sexual networks presents many challenges arising from the fact that most surveys collect information on only a very small fraction of the population of interest. In addition, partners are rarely identified and responses are subject to reporting biases. Typically, each network statistic of interest, such as mean number of sexual partners for men or women, is estimated independently of other network statistics. There is, however, a complex relationship among networks statistics; and knowledge of these relationships can aid in addressing concerns mentioned earlier. We develop a novel method that constrains a posterior predictive distribution of a collection of network statistics in order to leverage the relationships among network statistics in making inference about network properties of interest. The method ensures that inference on network properties is compatible with an actual network. Through extensive simulation studies, we also demonstrate that use of this method can improve estimates in settings where there is uncertainty that arises both from sampling and from systematic reporting bias compared with currently available approaches to estimation. To illustrate the method, we apply it to estimate network statistics using data from the Chicago Health and Social Life Survey. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Statistical methods for the analysis of high-throughput metabolomics data

Directory of Open Access Journals (Sweden)

Fabian J. Theis

2013-01-01

Full Text Available Metabolomics is a relatively new high-throughput technology that aims at measuring all endogenous metabolites within a biological sample in an unbiased fashion. The resulting metabolic profiles may be regarded as functional signatures of the physiological state, and have been shown to comprise effects of genetic regulation as well as environmental factors. This potential to connect genotypic to phenotypic information promises new insights and biomarkers for different research fields, including biomedical and pharmaceutical research. In the statistical analysis of metabolomics data, many techniques from other omics fields can be reused. However recently, a number of tools specific for metabolomics data have been developed as well. The focus of this mini review will be on recent advancements in the analysis of metabolomics data especially by utilizing Gaussian graphical models and independent component analysis.
Infodemiological data of high-school drop-out related web searches in Canada correlating with real-world statistical data in the period 2004–2012

Directory of Open Access Journals (Sweden)

Anna Siri

2016-12-01

Examining the data broken down by gender, the correlations were higher and statistically significant in males than in females. GT-based data for drop-out resulted best modeled by an ARMA(1,0 model. Considering the cross correlation of Canadian regions, all of them resulted statistically significant at lag 0, apart from for New Brunswick, Newfoundland and Labrador and the Prince Edward island. A number or cross-correlations resulted statistically significant also at lag −1 (namely, Alberta, Manitoba, New Brunswick and Saskatchewan.
Sources of Safety Data and Statistical Strategies for Design and Analysis: Postmarket Surveillance.

Science.gov (United States)

Izem, Rima; Sanchez-Kam, Matilde; Ma, Haijun; Zink, Richard; Zhao, Yueqin

2018-03-01

Safety data are continuously evaluated throughout the life cycle of a medical product to accurately assess and characterize the risks associated with the product. The knowledge about a medical product's safety profile continually evolves as safety data accumulate. This paper discusses data sources and analysis considerations for safety signal detection after a medical product is approved for marketing. This manuscript is the second in a series of papers from the American Statistical Association Biopharmaceutical Section Safety Working Group. We share our recommendations for the statistical and graphical methodologies necessary to appropriately analyze, report, and interpret safety outcomes, and we discuss the advantages and disadvantages of safety data obtained from passive postmarketing surveillance systems compared to other sources. Signal detection has traditionally relied on spontaneous reporting databases that have been available worldwide for decades. However, current regulatory guidelines and ease of reporting have increased the size of these databases exponentially over the last few years. With such large databases, data-mining tools using disproportionality analysis and helpful graphics are often used to detect potential signals. Although the data sources have many limitations, analyses of these data have been successful at identifying safety signals postmarketing. Experience analyzing these dynamic data is useful in understanding the potential and limitations of analyses with new data sources such as social media, claims, or electronic medical records data.
IMPROVING QUALITY OF STATISTICAL PROCESS CONTROL BY DEALING WITH NON‐NORMAL DATA IN AUTOMOTIVE INDUSTRY

Directory of Open Access Journals (Sweden)

Zuzana ANDRÁSSYOVÁ

2012-07-01

Full Text Available Study deals with an analysis of data to the effect that it improves the quality of statistical tools in processes of assembly of automobile seats. Normal distribution of variables is one of inevitable conditions for the analysis, examination, and improvement of the manufacturing processes (f. e.: manufacturing process capability although, there are constantly more approaches to non‐normal data handling. An appropriate probability distribution of measured data is firstly tested by the goodness of fit of empirical distribution with theoretical normal distribution on the basis of hypothesis testing using programme StatGraphics Centurion XV.II. Data are collected from the assembly process of 1st row automobile seats for each characteristic of quality (Safety Regulation ‐S/R individually. Study closely processes the measured data of an airbag´s assembly and it aims to accomplish the normal distributed data and apply it the statistical process control. Results of the contribution conclude in a statement of rejection of the null hypothesis (measured variables do not follow the normal distribution therefore it is necessary to begin to work on data transformation supported by Minitab15. Even this approach does not reach a normal distributed data and so should be proposed a procedure that leads to the quality output of whole statistical control of manufacturing processes.
Indexing Combined with Statistical Deflation as a Tool for Analysis of Longitudinal Data.

Science.gov (United States)

Babcock, Judith A.

Indexing is a tool that can be used with longitudinal, quantitative data for analysis of relative changes and for comparisons of changes among items. For greater accuracy, raw financial data should be deflated into constant dollars prior to indexing. This paper demonstrates the procedures for indexing, statistical deflation, and the use of…
Data Analysis and Graphing in an Introductory Physics Laboratory: Spreadsheet versus Statistics Suite

Science.gov (United States)

Peterlin, Primoz

2010-01-01

Two methods of data analysis are compared: spreadsheet software and a statistics software suite. Their use is compared analysing data collected in three selected experiments taken from an introductory physics laboratory, which include a linear dependence, a nonlinear dependence and a histogram. The merits of each method are compared. (Contains 7…
Visualization of time series statistical data by shape analysis (GDP ratio changes among Asia countries)

Science.gov (United States)

Shirota, Yukari; Hashimoto, Takako; Fitri Sari, Riri

2018-03-01

It has been very significant to visualize time series big data. In the paper we shall discuss a new analysis method called “statistical shape analysis” or “geometry driven statistics” on time series statistical data in economics. In the paper, we analyse the agriculture, value added and industry, value added (percentage of GDP) changes from 2000 to 2010 in Asia. We handle the data as a set of landmarks on a two-dimensional image to see the deformation using the principal components. The point of the analysis method is the principal components of the given formation which are eigenvectors of its bending energy matrix. The local deformation can be expressed as the set of non-Affine transformations. The transformations give us information about the local differences between in 2000 and in 2010. Because the non-Affine transformation can be decomposed into a set of partial warps, we present the partial warps visually. The statistical shape analysis is widely used in biology but, in economics, no application can be found. In the paper, we investigate its potential to analyse the economic data.
Statistical analysis of solid waste composition data: Arithmetic mean, standard deviation and correlation coefficients

DEFF Research Database (Denmark)

Edjabou, Maklawe Essonanawe; Martín-Fernández, Josep Antoni; Scheutz, Charlotte

2017-01-01

-derived food waste amounted to 2.21 ± 3.12% with a confidence interval of (−4.03; 8.45), which highlights the problem of the biased negative proportions. A Pearson’s correlation test, applied to waste fraction generation (kg mass), indicated a positive correlation between avoidable vegetable food waste...... and plastic packaging. However, correlation tests applied to waste fraction compositions (percentage values) showed a negative association in this regard, thus demonstrating that statistical analyses applied to compositional waste fraction data, without addressing the closed characteristics of these data......, have the potential to generate spurious or misleading results. Therefore, ¨compositional data should be transformed adequately prior to any statistical analysis, such as computing mean, standard deviation and correlation coefficients....
Statistical Analysis of Reactor Pressure Vessel Fluence Calculation Benchmark Data Using Multiple Regression Techniques

International Nuclear Information System (INIS)

Carew, John F.; Finch, Stephen J.; Lois, Lambros

2003-01-01

The calculated >1-MeV pressure vessel fluence is used to determine the fracture toughness and integrity of the reactor pressure vessel. It is therefore of the utmost importance to ensure that the fluence prediction is accurate and unbiased. In practice, this assurance is provided by comparing the predictions of the calculational methodology with an extensive set of accurate benchmarks. A benchmarking database is used to provide an estimate of the overall average measurement-to-calculation (M/C) bias in the calculations ( ). This average is used as an ad-hoc multiplicative adjustment to the calculations to correct for the observed calculational bias. However, this average only provides a well-defined and valid adjustment of the fluence if the M/C data are homogeneous; i.e., the data are statistically independent and there is no correlation between subsets of M/C data.Typically, the identification of correlations between the errors in the database M/C values is difficult because the correlation is of the same magnitude as the random errors in the M/C data and varies substantially over the database. In this paper, an evaluation of a reactor dosimetry benchmark database is performed to determine the statistical validity of the adjustment to the calculated pressure vessel fluence. Physical mechanisms that could potentially introduce a correlation between the subsets of M/C ratios are identified and included in a multiple regression analysis of the M/C data. Rigorous statistical criteria are used to evaluate the homogeneity of the M/C data and determine the validity of the adjustment.For the database evaluated, the M/C data are found to be strongly correlated with dosimeter response threshold energy and dosimeter location (e.g., cavity versus in-vessel). It is shown that because of the inhomogeneity in the M/C data, for this database, the benchmark data do not provide a valid basis for adjusting the pressure vessel fluence.The statistical criteria and methods employed in
Statistical processing of natality data for the Czech Republic before and after the Chernobyl accident

International Nuclear Information System (INIS)

Klepetkova, Hana; Thinova, Lenka

2010-01-01

All available data regarding natality in Czechoslovakia (or the Czech Republic) before and after the Chernobyl accident are summarized. Data from the databases of the Czech Statistical Office and of the State Office for Nuclear Safety were used to analyze natality and mortality of children in the Czech Republic and to evaluate the relationship between the level of contamination and the change in the sex ratio at time of birth that was observed in some areas in November of 1986. Although the change in the ratio of newborn boys-to-girls ratio was statistically significant, no direct relationship between that ratio and the level of contamination was found. Statistically significant changes in the sex ratio also occurred in Czechoslovakia (or in the Czech Republic) in the past, both before and after the accident. Furthermore, no statistically significant changes in the rate of stillbirths and multiple pregnancies were observed after the Chernobyl accident
Statistical analysis of fatigue strain-life data for carbon and low-alloy steels

International Nuclear Information System (INIS)

Keisler, J.; Chopra, O.K.

1995-03-01

The existing fatigue strain vs life (S-N) data, foreign and domestic, for carbon and low-alloy steels used in the construction of nuclear power plant components have been compiled and categorized according to material, loading, and environmental conditions. A statistical model has been developed for estimating the effects of the various test conditions on fatigue life. The results of a rigorous statistical analysis have been used to estimate the probability of initiating a fatigue crack. Data in the literature were reviewed to evaluate the effects of size, geometry, and surface finish of a component on its fatigue life. The fatigue S-N curves for components have been determined by applying design margins for size, geometry, and surface finish to crack initiation curves estimated from the model
Propagation of statistical and nuclear data uncertainties in Monte Carlo burn-up calculations

International Nuclear Information System (INIS)

Garcia-Herranz, Nuria; Cabellos, Oscar; Sanz, Javier; Juan, Jesus; Kuijper, Jim C.

2008-01-01

Two methodologies to propagate the uncertainties on the nuclide inventory in combined Monte Carlo-spectrum and burn-up calculations are presented, based on sensitivity/uncertainty and random sampling techniques (uncertainty Monte Carlo method). Both enable the assessment of the impact of uncertainties in the nuclear data as well as uncertainties due to the statistical nature of the Monte Carlo neutron transport calculation. The methodologies are implemented in our MCNP-ACAB system, which combines the neutron transport code MCNP-4C and the inventory code ACAB. A high burn-up benchmark problem is used to test the MCNP-ACAB performance in inventory predictions, with no uncertainties. A good agreement is found with the results of other participants. This benchmark problem is also used to assess the impact of nuclear data uncertainties and statistical flux errors in high burn-up applications. A detailed calculation is performed to evaluate the effect of cross-section uncertainties in the inventory prediction, taking into account the temporal evolution of the neutron flux level and spectrum. Very large uncertainties are found at the unusually high burn-up of this exercise (800 MWd/kgHM). To compare the impact of the statistical errors in the calculated flux with respect to the cross uncertainties, a simplified problem is considered, taking a constant neutron flux level and spectrum. It is shown that, provided that the flux statistical deviations in the Monte Carlo transport calculation do not exceed a given value, the effect of the flux errors in the calculated isotopic inventory are negligible (even at very high burn-up) compared to the effect of the large cross-section uncertainties available at present in the data files

Propagation of statistical and nuclear data uncertainties in Monte Carlo burn-up calculations

Energy Technology Data Exchange (ETDEWEB)

Garcia-Herranz, Nuria [Departamento de Ingenieria Nuclear, Universidad Politecnica de Madrid, UPM (Spain)], E-mail: nuria@din.upm.es; Cabellos, Oscar [Departamento de Ingenieria Nuclear, Universidad Politecnica de Madrid, UPM (Spain); Sanz, Javier [Departamento de Ingenieria Energetica, Universidad Nacional de Educacion a Distancia, UNED (Spain); Juan, Jesus [Laboratorio de Estadistica, Universidad Politecnica de Madrid, UPM (Spain); Kuijper, Jim C. [NRG - Fuels, Actinides and Isotopes Group, Petten (Netherlands)

2008-04-15

Two methodologies to propagate the uncertainties on the nuclide inventory in combined Monte Carlo-spectrum and burn-up calculations are presented, based on sensitivity/uncertainty and random sampling techniques (uncertainty Monte Carlo method). Both enable the assessment of the impact of uncertainties in the nuclear data as well as uncertainties due to the statistical nature of the Monte Carlo neutron transport calculation. The methodologies are implemented in our MCNP-ACAB system, which combines the neutron transport code MCNP-4C and the inventory code ACAB. A high burn-up benchmark problem is used to test the MCNP-ACAB performance in inventory predictions, with no uncertainties. A good agreement is found with the results of other participants. This benchmark problem is also used to assess the impact of nuclear data uncertainties and statistical flux errors in high burn-up applications. A detailed calculation is performed to evaluate the effect of cross-section uncertainties in the inventory prediction, taking into account the temporal evolution of the neutron flux level and spectrum. Very large uncertainties are found at the unusually high burn-up of this exercise (800 MWd/kgHM). To compare the impact of the statistical errors in the calculated flux with respect to the cross uncertainties, a simplified problem is considered, taking a constant neutron flux level and spectrum. It is shown that, provided that the flux statistical deviations in the Monte Carlo transport calculation do not exceed a given value, the effect of the flux errors in the calculated isotopic inventory are negligible (even at very high burn-up) compared to the effect of the large cross-section uncertainties available at present in the data files.
Integrating functional data to prioritize causal variants in statistical fine-mapping studies.

Directory of Open Access Journals (Sweden)

Gleb Kichaev

2014-10-01

Full Text Available Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy. Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.
Data-driven inference for the spatial scan statistic.

Science.gov (United States)

Almeida, Alexandre C L; Duarte, Anderson R; Duczmal, Luiz H; Oliveira, Fernando L P; Takahashi, Ricardo H C

2011-08-02

Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas) or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. A modification is proposed to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found. A new interpretation of the results of the spatial scan statistic is done, posing a modified inference question: what is the probability that the null hypothesis is rejected for the original observed cases map with a most likely cluster of size k, taking into account only those most likely clusters of size k found under null hypothesis for comparison? This question is especially important when the p-value computed by the usual inference process is near the alpha significance level, regarding the correctness of the decision based in this inference. A practical procedure is provided to make more accurate inferences about the most likely cluster found by the spatial scan statistic.
Equipment Maintenance management support system based on statistical analysis of maintenance history data

International Nuclear Information System (INIS)

Shimizu, S.; Ando, Y.; Morioka, T.

1990-01-01

Plant maintenance is recently becoming important with the increase in the number of nuclear power stations and in plant operating time. Various kinds of requirements for plant maintenance, such as countermeasures for equipment degradation and saving maintenance costs while keeping up plant reliability and productivity, are proposed. For this purpose, plant maintenance programs should be improved based on equipment reliability estimated by field data. In order to meet these requirements, it is planned to develop an equipment maintenance management support system for nuclear power plants based on statistical analysis of equipment maintenance history data. The large difference between this proposed new method and current similar methods is to evaluate not only failure data but maintenance data, which includes normal termination data and some degree of degradation or functional disorder data for equipment and parts. So, it is possible to utilize these field data for improving maintenance schedules and to evaluate actual equipment and parts reliability under the current maintenance schedule. In the present paper, the authors show the objectives of this system, an outline of this system and its functions, and the basic technique for collecting and managing of maintenance history data on statistical analysis. It is shown, from the results of feasibility tests using simulation data of maintenance history, that this system has the ability to provide useful information for maintenance and the design enhancement
[Continuity of hospital identifiers in hospital discharge data - Analysis of the nationwide German DRG Statistics from 2005 to 2013].

Science.gov (United States)

Nimptsch, Ulrike; Wengler, Annelene; Mansky, Thomas

2016-11-01

In Germany, nationwide hospital discharge data (DRG statistics provided by the research data centers of the Federal Statistical Office and the Statistical Offices of the 'Länder') are increasingly used as data source for health services research. Within this data hospitals can be separated via their hospital identifier ([Institutionskennzeichen] IK). However, this hospital identifier primarily designates the invoicing unit and is not necessarily equivalent to one hospital location. Aiming to investigate direction and extent of possible bias in hospital-level analyses this study examines the continuity of the hospital identifier within a cross-sectional and longitudinal approach and compares the results to official hospital census statistics. Within the DRG statistics from 2005 to 2013 the annual number of hospitals as classified by hospital identifiers was counted for each year of observation. The annual number of hospitals derived from DRG statistics was compared to the number of hospitals in the official census statistics 'Grunddaten der Krankenhäuser'. Subsequently, the temporal continuity of hospital identifiers in the DRG statistics was analyzed within cohorts of hospitals. Until 2013, the annual number of hospital identifiers in the DRG statistics fell by 175 (from 1,725 to 1,550). This decline affected only providers with small or medium case volume. The number of hospitals identified in the DRG statistics was lower than the number given in the census statistics (e.g., in 2013 1,550 IK vs. 1,668 hospitals in the census statistics). The longitudinal analyses revealed that the majority of hospital identifiers persisted in the years of observation, while one fifth of hospital identifiers changed. In cross-sectional studies of German hospital discharge data the separation of hospitals via the hospital identifier might lead to underestimating the number of hospitals and consequential overestimation of caseload per hospital. Discontinuities of hospital
Explorations in statistics: the analysis of ratios and normalized data.

Science.gov (United States)

Curran-Everett, Douglas

2013-09-01

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This ninth installment of Explorations in Statistics explores the analysis of ratios and normalized-or standardized-data. As researchers, we compute a ratio-a numerator divided by a denominator-to compute a proportion for some biological response or to derive some standardized variable. In each situation, we want to control for differences in the denominator when the thing we really care about is the numerator. But there is peril lurking in a ratio: only if the relationship between numerator and denominator is a straight line through the origin will the ratio be meaningful. If not, the ratio will misrepresent the true relationship between numerator and denominator. In contrast, regression techniques-these include analysis of covariance-are versatile: they can accommodate an analysis of the relationship between numerator and denominator when a ratio is useless.
Accidents in Malaysian construction industry: statistical data and court cases.

Science.gov (United States)

Chong, Heap Yih; Low, Thuan Siang

2014-01-01

Safety and health issues remain critical to the construction industry due to its working environment and the complexity of working practises. This research attempts to adopt 2 research approaches using statistical data and court cases to address and identify the causes and behavior underlying construction safety and health issues in Malaysia. Factual data on the period of 2000-2009 were retrieved to identify the causes and agents that contributed to health issues. Moreover, court cases were tabulated and analyzed to identify legal patterns of parties involved in construction site accidents. Approaches of this research produced consistent results and highlighted a significant reduction in the rate of accidents per construction project in Malaysia.
Segmentation of human skull in MRI using statistical shape information from CT data.

Science.gov (United States)

Wang, Defeng; Shi, Lin; Chu, Winnie C W; Cheng, Jack C Y; Heng, Pheng Ann

2009-09-01

To automatically segment the skull from the MRI data using a model-based three-dimensional segmentation scheme. This study exploited the statistical anatomy extracted from the CT data of a group of subjects by means of constructing an active shape model of the skull surfaces. To construct a reliable shape model, a novel approach was proposed to optimize the automatic landmarking on the coupled surfaces (i.e., the skull vault) by minimizing the description length that incorporated local thickness information. This model was then used to locate the skull shape in MRI of a different group of patients. Compared with performing landmarking separately on the coupled surfaces, the proposed landmarking method constructed models that had better generalization ability and specificity. The segmentation accuracies were measured by the Dice coefficient and the set difference, and compared with the method based on mathematical morphology operations. The proposed approach using the active shape model based on the statistical skull anatomy presented in the head CT data contributes to more reliable segmentation of the skull from MRI data.
Demonstration of a software design and statistical analysis methodology with application to patient outcomes data sets.

Science.gov (United States)

Mayo, Charles; Conners, Steve; Warren, Christopher; Miller, Robert; Court, Laurence; Popple, Richard

2013-11-01

With emergence of clinical outcomes databases as tools utilized routinely within institutions, comes need for software tools to support automated statistical analysis of these large data sets and intrainstitutional exchange from independent federated databases to support data pooling. In this paper, the authors present a design approach and analysis methodology that addresses both issues. A software application was constructed to automate analysis of patient outcomes data using a wide range of statistical metrics, by combining use of C#.Net and R code. The accuracy and speed of the code was evaluated using benchmark data sets. The approach provides data needed to evaluate combinations of statistical measurements for ability to identify patterns of interest in the data. Through application of the tools to a benchmark data set for dose-response threshold and to SBRT lung data sets, an algorithm was developed that uses receiver operator characteristic curves to identify a threshold value and combines use of contingency tables, Fisher exact tests, Welch t-tests, and Kolmogorov-Smirnov tests to filter the large data set to identify values demonstrating dose-response. Kullback-Leibler divergences were used to provide additional confirmation. The work demonstrates the viability of the design approach and the software tool for analysis of large data sets.
Sources of Safety Data and Statistical Strategies for Design and Analysis: Clinical Trials.

Science.gov (United States)

Zink, Richard C; Marchenko, Olga; Sanchez-Kam, Matilde; Ma, Haijun; Jiang, Qi

2018-03-01

There has been an increased emphasis on the proactive and comprehensive evaluation of safety endpoints to ensure patient well-being throughout the medical product life cycle. In fact, depending on the severity of the underlying disease, it is important to plan for a comprehensive safety evaluation at the start of any development program. Statisticians should be intimately involved in this process and contribute their expertise to study design, safety data collection, analysis, reporting (including data visualization), and interpretation. In this manuscript, we review the challenges associated with the analysis of safety endpoints and describe the safety data that are available to influence the design and analysis of premarket clinical trials. We share our recommendations for the statistical and graphical methodologies necessary to appropriately analyze, report, and interpret safety outcomes, and we discuss the advantages and disadvantages of safety data obtained from clinical trials compared to other sources. Clinical trials are an important source of safety data that contribute to the totality of safety information available to generate evidence for regulators, sponsors, payers, physicians, and patients. This work is a result of the efforts of the American Statistical Association Biopharmaceutical Section Safety Working Group.
Statistical significance of epidemiological data. Seminar: Evaluation of epidemiological studies

International Nuclear Information System (INIS)

Weber, K.H.

1993-01-01

In stochastic damages, the numbers of events, e.g. the persons who are affected by or have died of cancer, and thus the relative frequencies (incidence or mortality) are binomially distributed random variables. Their statistical fluctuations can be characterized by confidence intervals. For epidemiologic questions, especially for the analysis of stochastic damages in the low dose range, the following issues are interesting: - Is a sample (a group of persons) with a definite observed damage frequency part of the whole population? - Is an observed frequency difference between two groups of persons random or statistically significant? - Is an observed increase or decrease of the frequencies with increasing dose random or statistically significant and how large is the regression coefficient (= risk coefficient) in this case? These problems can be solved by sttistical tests. So-called distribution-free tests and tests which are not bound to the supposition of normal distribution are of particular interest, such as: - χ 2 -independence test (test in contingency tables); - Fisher-Yates-test; - trend test according to Cochran; - rank correlation test given by Spearman. These tests are explained in terms of selected epidemiologic data, e.g. of leukaemia clusters, of the cancer mortality of the Japanese A-bomb survivors especially in the low dose range as well as on the sample of the cancer mortality in the high background area in Yangjiang (China). (orig.) [de
A Guideline to Univariate Statistical Analysis for LC/MS-Based Untargeted Metabolomics-Derived Data

Directory of Open Access Journals (Sweden)

Maria Vinaixa

2012-10-01

Full Text Available Several metabolomic software programs provide methods for peak picking, retention time alignment and quantification of metabolite features in LC/MS-based metabolomics. Statistical analysis, however, is needed in order to discover those features significantly altered between samples. By comparing the retention time and MS/MS data of a model compound to that from the altered feature of interest in the research sample, metabolites can be then unequivocally identified. This paper reports on a comprehensive overview of a workflow for statistical analysis to rank relevant metabolite features that will be selected for further MS/MS experiments. We focus on univariate data analysis applied in parallel on all detected features. Characteristics and challenges of this analysis are discussed and illustrated using four different real LC/MS untargeted metabolomic datasets. We demonstrate the influence of considering or violating mathematical assumptions on which univariate statistical test rely, using high-dimensional LC/MS datasets. Issues in data analysis such as determination of sample size, analytical variation, assumption of normality and homocedasticity, or correction for multiple testing are discussed and illustrated in the context of our four untargeted LC/MS working examples.
Mourning dove hunting regulation strategy based on annual harvest statistics and banding data

Science.gov (United States)

Otis, D.L.

2006-01-01

Although managers should strive to base game bird harvest management strategies on mechanistic population models, monitoring programs required to build and continuously update these models may not be in place. Alternatively, If estimates of total harvest and harvest rates are available, then population estimates derived from these harvest data can serve as the basis for making hunting regulation decisions based on population growth rates derived from these estimates. I present a statistically rigorous approach for regulation decision-making using a hypothesis-testing framework and an assumed framework of 3 hunting regulation alternatives. I illustrate and evaluate the technique with historical data on the mid-continent mallard (Anas platyrhynchos) population. I evaluate the statistical properties of the hypothesis-testing framework using the best available data on mourning doves (Zenaida macroura). I use these results to discuss practical implementation of the technique as an interim harvest strategy for mourning doves until reliable mechanistic population models and associated monitoring programs are developed.
42 CFR 417.568 - Adequate financial records, statistical data, and cost finding.

Science.gov (United States)

2010-10-01

... this section, on the accrual method of accounting. (3) For governmental institutions that use a cash basis of accounting, cost data developed on this basis is acceptable. However, only depreciation on... definitions and accounting, statistics, and reporting practices that are widely accepted in the health care...
A rank-based algorithm of differential expression analysis for small cell line data with statistical control.

Science.gov (United States)

Li, Xiangyu; Cai, Hao; Wang, Xianlong; Ao, Lu; Guo, You; He, Jun; Gu, Yunyan; Qi, Lishuang; Guan, Qingzhou; Lin, Xu; Guo, Zheng

2017-10-13

To detect differentially expressed genes (DEGs) in small-scale cell line experiments, usually with only two or three technical replicates for each state, the commonly used statistical methods such as significance analysis of microarrays (SAM), limma and RankProd (RP) lack statistical power, while the fold change method lacks any statistical control. In this study, we demonstrated that the within-sample relative expression orderings (REOs) of gene pairs were highly stable among technical replicates of a cell line but often widely disrupted after certain treatments such like gene knockdown, gene transfection and drug treatment. Based on this finding, we customized the RankComp algorithm, previously designed for individualized differential expression analysis through REO comparison, to identify DEGs with certain statistical control for small-scale cell line data. In both simulated and real data, the new algorithm, named CellComp, exhibited high precision with much higher sensitivity than the original RankComp, SAM, limma and RP methods. Therefore, CellComp provides an efficient tool for analyzing small-scale cell line data. © The Author 2017. Published by Oxford University Press.
The effect of project-based learning on students' statistical literacy levels for data representation

Science.gov (United States)

Koparan, Timur; Güven, Bülent

2015-07-01

The point of this study is to define the effect of project-based learning approach on 8th Grade secondary-school students' statistical literacy levels for data representation. To achieve this goal, a test which consists of 12 open-ended questions in accordance with the views of experts was developed. Seventy 8th grade secondary-school students, 35 in the experimental group and 35 in the control group, took this test twice, one before the application and one after the application. All the raw scores were turned into linear points by using the Winsteps 3.72 modelling program that makes the Rasch analysis and t-tests, and an ANCOVA analysis was carried out with the linear points. Depending on the findings, it was concluded that the project-based learning approach increases students' level of statistical literacy for data representation. Students' levels of statistical literacy before and after the application were shown through the obtained person-item maps.
Data-driven inference for the spatial scan statistic

Directory of Open Access Journals (Sweden)

Duczmal Luiz H

2011-08-01

Full Text Available Abstract Background Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. Results A modification is proposed to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found. A new interpretation of the results of the spatial scan statistic is done, posing a modified inference question: what is the probability that the null hypothesis is rejected for the original observed cases map with a most likely cluster of size k, taking into account only those most likely clusters of size k found under null hypothesis for comparison? This question is especially important when the p-value computed by the usual inference process is near the alpha significance level, regarding the correctness of the decision based in this inference. Conclusions A practical procedure is provided to make more accurate inferences about the most likely cluster found by the spatial scan statistic.
IEEE Std 101-1987: IEEE guide for the statistical analysis of thermal life test data

International Nuclear Information System (INIS)

Anon.

1992-01-01

This revision of IEEE Std 101-1972 describes statistical analyses for data from thermally accelerated aging tests. It explains the basis and use of statistical calculations for an engineer or scientist. Accelerated test procedures usually call for a number of specimens to be aged at each of several temperatures appreciably above normal operating temperatures. High temperatures are chosen to produce specimen failures (according to specified failure criteria) in typically one week to one year. The test objective is to determine the dependence of median life on temperature from the data, and to estimate, by extrapolation, the median life to be expected at service temperature. This guide presents methods for analyzing such data and for comparing test data on different materials
Diffusion-Based Density-Equalizing Maps: an Interdisciplinary Approach to Visualizing Homicide Rates and Other Georeferenced Statistical Data

Science.gov (United States)

Mazzitello, Karina I.; Candia, Julián

2012-12-01

In every country, public and private agencies allocate extensive funding to collect large-scale statistical data, which in turn are studied and analyzed in order to determine local, regional, national, and international policies regarding all aspects relevant to the welfare of society. One important aspect of that process is the visualization of statistical data with embedded geographical information, which most often relies on archaic methods such as maps colored according to graded scales. In this work, we apply nonstandard visualization techniques based on physical principles. We illustrate the method with recent statistics on homicide rates in Brazil and their correlation to other publicly available data. This physics-based approach provides a novel tool that can be used by interdisciplinary teams investigating statistics and model projections in a variety of fields such as economics and gross domestic product research, public health and epidemiology, sociodemographics, political science, business and marketing, and many others.
Summary Statistics for Homemade ?Play Dough? -- Data Acquired at LLNL

Energy Technology Data Exchange (ETDEWEB)

Kallman, J S; Morales, K E; Whipple, R E; Huber, R D; Martz, A; Brown, W D; Smith, J A; Schneberk, D J; Martz, Jr., H E; White, III, W T

2010-03-11

Using x-ray computerized tomography (CT), we have characterized the x-ray linear attenuation coefficients (LAC) of a homemade Play Dough{trademark}-like material, designated as PDA. Table 1 gives the first-order statistics for each of four CT measurements, estimated with a Gaussian kernel density estimator (KDE) analysis. The mean values of the LAC range from a high of about 2700 LMHU{sub D} 100kVp to a low of about 1200 LMHUD at 300kVp. The standard deviation of each measurement is around 10% to 15% of the mean. The entropy covers the range from 6.0 to 7.4. Ordinarily, we would model the LAC of the material and compare the modeled values to the measured values. In this case, however, we did not have the detailed chemical composition of the material and therefore did not model the LAC. Using a method recently proposed by Lawrence Livermore National Laboratory (LLNL), we estimate the value of the effective atomic number, Z{sub eff}, to be near 10. LLNL prepared about 50mL of the homemade 'Play Dough' in a polypropylene vial and firmly compressed it immediately prior to the x-ray measurements. We used the computer program IMGREC to reconstruct the CT images. The values of the key parameters used in the data capture and image reconstruction are given in this report. Additional details may be found in the experimental SOP and a separate document. To characterize the statistical distribution of LAC values in each CT image, we first isolated an 80% central-core segment of volume elements ('voxels') lying completely within the specimen, away from the walls of the polypropylene vial. All of the voxels within this central core, including those comprised of voids and inclusions, are included in the statistics. We then calculated the mean value, standard deviation and entropy for (a) the four image segments and for (b) their digital gradient images. (A digital gradient image of a given image was obtained by taking the absolute value of the difference

Energy statistics yearbook 2002

International Nuclear Information System (INIS)

2005-01-01

The Energy Statistics Yearbook 2002 is a comprehensive collection of international energy statistics prepared by the United Nations Statistics Division. It is the forty-sixth in a series of annual compilations which commenced under the title World Energy Supplies in Selected Years, 1929-1950. It updates the statistical series shown in the previous issue. Supplementary series of monthly and quarterly data on production of energy may be found in the Monthly Bulletin of Statistics. The principal objective of the Yearbook is to provide a global framework of comparable data on long-term trends in the supply of mainly commercial primary and secondary forms of energy. Data for each type of fuel and aggregate data for the total mix of commercial fuels are shown for individual countries and areas and are summarized into regional and world totals. The data are compiled primarily from the annual energy questionnaire distributed by the United Nations Statistics Division and supplemented by official national statistical publications. Where official data are not available or are inconsistent, estimates are made by the Statistics Division based on governmental, professional or commercial materials. Estimates include, but are not limited to, extrapolated data based on partial year information, use of annual trends, trade data based on partner country reports, breakdowns of aggregated data as well as analysis of current energy events and activities
Energy statistics yearbook 2001

International Nuclear Information System (INIS)

2004-01-01

The Energy Statistics Yearbook 2001 is a comprehensive collection of international energy statistics prepared by the United Nations Statistics Division. It is the forty-fifth in a series of annual compilations which commenced under the title World Energy Supplies in Selected Years, 1929-1950. It updates the statistical series shown in the previous issue. Supplementary series of monthly and quarterly data on production of energy may be found in the Monthly Bulletin of Statistics. The principal objective of the Yearbook is to provide a global framework of comparable data on long-term trends in the supply of mainly commercial primary and secondary forms of energy. Data for each type of fuel and aggregate data for the total mix of commercial fuels are shown for individual countries and areas and are summarized into regional and world totals. The data are compiled primarily from the annual energy questionnaire distributed by the United Nations Statistics Division and supplemented by official national statistical publications. Where official data are not available or are inconsistent, estimates are made by the Statistics Division based on governmental, professional or commercial materials. Estimates include, but are not limited to, extrapolated data based on partial year information, use of annual trends, trade data based on partner country reports, breakdowns of aggregated data as well as analysis of current energy events and activities
Energy statistics yearbook 2000

International Nuclear Information System (INIS)

2002-01-01

The Energy Statistics Yearbook 2000 is a comprehensive collection of international energy statistics prepared by the United Nations Statistics Division. It is the forty-third in a series of annual compilations which commenced under the title World Energy Supplies in Selected Years, 1929-1950. It updates the statistical series shown in the previous issue. Supplementary series of monthly and quarterly data on production of energy may be found in the Monthly Bulletin of Statistics. The principal objective of the Yearbook is to provide a global framework of comparable data on long-term trends in the supply of mainly commercial primary and secondary forms of energy. Data for each type of fuel and aggregate data for the total mix of commercial fuels are shown for individual countries and areas and are summarized into regional and world totals. The data are compiled primarily from the annual energy questionnaire distributed by the United Nations Statistics Division and supplemented by official national statistical publications. Where official data are not available or are inconsistent, estimates are made by the Statistics Division based on governmental, professional or commercial materials. Estimates include, but are not limited to, extrapolated data based on partial year information, use of annual trends, trade data based on partner country reports, breakdowns of aggregated data as well as analysis of current energy events and activities
Statistical Techniques For Real-time Anomaly Detection Using Spark Over Multi-source VMware Performance Data

Energy Technology Data Exchange (ETDEWEB)

Solaimani, Mohiuddin [Univ. of Texas-Dallas, Richardson, TX (United States); Iftekhar, Mohammed [Univ. of Texas-Dallas, Richardson, TX (United States); Khan, Latifur [Univ. of Texas-Dallas, Richardson, TX (United States); Thuraisingham, Bhavani [Univ. of Texas-Dallas, Richardson, TX (United States); Ingram, Joey Burton [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

2015-09-01

Anomaly detection refers to the identi cation of an irregular or unusual pat- tern which deviates from what is standard, normal, or expected. Such deviated patterns typically correspond to samples of interest and are assigned different labels in different domains, such as outliers, anomalies, exceptions, or malware. Detecting anomalies in fast, voluminous streams of data is a formidable chal- lenge. This paper presents a novel, generic, real-time distributed anomaly detection framework for heterogeneous streaming data where anomalies appear as a group. We have developed a distributed statistical approach to build a model and later use it to detect anomaly. As a case study, we investigate group anomaly de- tection for a VMware-based cloud data center, which maintains a large number of virtual machines (VMs). We have built our framework using Apache Spark to get higher throughput and lower data processing time on streaming data. We have developed a window-based statistical anomaly detection technique to detect anomalies that appear sporadically. We then relaxed this constraint with higher accuracy by implementing a cluster-based technique to detect sporadic and continuous anomalies. We conclude that our cluster-based technique out- performs other statistical techniques with higher accuracy and lower processing time.
Statistical transformation and the interpretation of inpatient glucose control data from the intensive care unit.

Science.gov (United States)

Saulnier, George E; Castro, Janna C; Cook, Curtiss B

2014-05-01

Glucose control can be problematic in critically ill patients. We evaluated the impact of statistical transformation on interpretation of intensive care unit inpatient glucose control data. Point-of-care blood glucose (POC-BG) data derived from patients in the intensive care unit for 2011 was obtained. Box-Cox transformation of POC-BG measurements was performed, and distribution of data was determined before and after transformation. Different data subsets were used to establish statistical upper and lower control limits. Exponentially weighted moving average (EWMA) control charts constructed from April, October, and November data determined whether out-of-control events could be identified differently in transformed versus nontransformed data. A total of 8679 POC-BG values were analyzed. POC-BG distributions in nontransformed data were skewed but approached normality after transformation. EWMA control charts revealed differences in projected detection of out-of-control events. In April, an out-of-control process resulting in the lower control limit being exceeded was identified at sample 116 in nontransformed data but not in transformed data. October transformed data detected an out-of-control process exceeding the upper control limit at sample 27 that was not detected in nontransformed data. Nontransformed November results remained in control, but transformation identified an out-of-control event less than 10 samples into the observation period. Using statistical methods to assess population-based glucose control in the intensive care unit could alter conclusions about the effectiveness of care processes for managing hyperglycemia. Further study is required to determine whether transformed versus nontransformed data change clinical decisions about the interpretation of care or intervention results. © 2014 Diabetes Technology Society.
Production-distribution of electric power in France: 1997-98 statistical data

International Nuclear Information System (INIS)

1999-01-01

This document has been realized using the annual inquiry carried out by the French direction of gas, electricity and coal (Digec). It brings together the main statistical data about the production, transport and consumption of electric power in France: 1997 and 1998 balance sheets, foreign exchanges, long-term evolutions, production with respect to the different energy sources, consumption in the different departments and regions.. (J.S.)
Statistical data analysis

International Nuclear Information System (INIS)

Hahn, A.A.

1994-11-01

The complexity of instrumentation sometimes requires data analysis to be done before the result is presented to the control room. This tutorial reviews some of the theoretical assumptions underlying the more popular forms of data analysis and presents simple examples to illuminate the advantages and hazards of different techniques
Statistics

CERN Document Server

Hayslett, H T

1991-01-01

Statistics covers the basic principles of Statistics. The book starts by tackling the importance and the two kinds of statistics; the presentation of sample data; the definition, illustration and explanation of several measures of location; and the measures of variation. The text then discusses elementary probability, the normal distribution and the normal approximation to the binomial. Testing of statistical hypotheses and tests of hypotheses about the theoretical proportion of successes in a binomial population and about the theoretical mean of a normal population are explained. The text the
A multivariate statistical study on a diversified data gathering system for nuclear power plants

International Nuclear Information System (INIS)

Samanta, P.K.; Teichmann, T.; Levine, M.M.; Kato, W.Y.

1989-02-01

In this report, multivariate statistical methods are presented and applied to demonstrate their use in analyzing nuclear power plant operational data. For analyses of nuclear power plant events, approaches are presented for detecting malfunctions and degradations within the course of the event. At the system level, approaches are investigated as a means of diagnosis of system level performance. This involves the detection of deviations from normal performance of the system. The input data analyzed are the measurable physical parameters, such as steam generator level, pressurizer water level, auxiliary feedwater flow, etc. The study provides the methodology and illustrative examples based on data gathered from simulation of nuclear power plant transients and computer simulation of a plant system performance (due to lack of easily accessible operational data). Such an approach, once fully developed, can be used to explore statistically the detection of failure trends and patterns and prevention of conditions with serious safety implications. 33 refs., 18 figs., 9 tabs
Statistical Clustering and Compositional Modeling of Iapetus VIMS Spectral Data

Science.gov (United States)

Pinilla-Alonso, N.; Roush, T. L.; Marzo, G.; Dalle Ore, C. M.; Cruikshank, D. P.

2009-12-01

It has long been known that the surfaces of Saturn's major satellites are predominantly icy objects [e.g. 1 and references therein]. Since 2004, these bodies have been the subject of observations by the Cassini-VIMS (Visual and Infrared Mapping Spectrometer) experiment [2]. Iapetus has the unique property that the hemisphere centered on the apex of its locked synchronous orbital motion around Saturn has a very low geometrical albedo of 2-6%, while the opposite hemisphere is about 10 times more reflective. The nature and origin of the dark material of Iapetus has remained a question since its discovery [3 and references therein]. The nature of this material and how it is distributed on the surface of this body, can shed new light into the knowledge of the Saturnian system. We apply statistical clustering [4] and theoretical modeling [5,6] to address the surface composition of Iapetus. The VIMS data evaluated were obtained during the second flyby of Iapetus, in September 2007. This close approach allowed VIMS to obtain spectra at relatively high spatial resolution, ~1-22 km/pixel. The data we study sampled the trailing hemisphere and part of the dark leading one. The statistical clustering [4] is used to identify statistically distinct spectra on Iapetus. The composition of these distinct spectra are evaluated using theoretical models [5,6]. We thank Allan Meyer for his help. This research was supported by an appointment to the NASA Postdoctoral Program at the Ames Research Center, administered by Oak Ridge Associated Universities through a contract with NASA. [1] A, Coradini et al., 2009, Earth, Moon & Planets, 105, 289-310. [2] Brown et al., 2004, Space Science Reviews, 115, 111-168. [3] Cruikshank, D. et al Icarus, 2008, 193, 334-343. [4] Marzo, G. et al. 2008, Journal of Geophysical Research, 113, E12, CiteID E12009. [5] Hapke, B. 1993, Theory of reflectance and emittance spectroscopy, Cambridge University Press. [6] Shkuratov, Y. et al. 1999, Icarus, 137, 235-246.
Evaluation of the Wishart test statistics for polarimetric SAR data

DEFF Research Database (Denmark)

Skriver, Henning; Nielsen, Allan Aasbjerg; Conradsen, Knut

2003-01-01

A test statistic for equality of two covariance matrices following the complex Wishart distribution has previously been used in new algorithms for change detection, edge detection and segmentation in polarimetric SAR images. Previously, the results for change detection and edge detection have been...... quantitatively evaluated. This paper deals with the evaluation of segmentation. A segmentation performance measure originally developed for single-channel SAR images has been extended to polarimetric SAR images, and used to evaluate segmentation for a merge-using-moment algorithm for polarimetric SAR data....
Statistics Clinic

Science.gov (United States)

Feiveson, Alan H.; Foy, Millennia; Ploutz-Snyder, Robert; Fiedler, James

2014-01-01

Do you have elevated p-values? Is the data analysis process getting you down? Do you experience anxiety when you need to respond to criticism of statistical methods in your manuscript? You may be suffering from Insufficient Statistical Support Syndrome (ISSS). For symptomatic relief of ISSS, come for a free consultation with JSC biostatisticians at our help desk during the poster sessions at the HRP Investigators Workshop. Get answers to common questions about sample size, missing data, multiple testing, when to trust the results of your analyses and more. Side effects may include sudden loss of statistics anxiety, improved interpretation of your data, and increased confidence in your results.
A statistical model for interpreting computerized dynamic posturography data

Science.gov (United States)

Feiveson, Alan H.; Metter, E. Jeffrey; Paloski, William H.

2002-01-01

Computerized dynamic posturography (CDP) is widely used for assessment of altered balance control. CDP trials are quantified using the equilibrium score (ES), which ranges from zero to 100, as a decreasing function of peak sway angle. The problem of how best to model and analyze ESs from a controlled study is considered. The ES often exhibits a skewed distribution in repeated trials, which can lead to incorrect inference when applying standard regression or analysis of variance models. Furthermore, CDP trials are terminated when a patient loses balance. In these situations, the ES is not observable, but is assigned the lowest possible score--zero. As a result, the response variable has a mixed discrete-continuous distribution, further compromising inference obtained by standard statistical methods. Here, we develop alternative methodology for analyzing ESs under a stochastic model extending the ES to a continuous latent random variable that always exists, but is unobserved in the event of a fall. Loss of balance occurs conditionally, with probability depending on the realized latent ES. After fitting the model by a form of quasi-maximum-likelihood, one may perform statistical inference to assess the effects of explanatory variables. An example is provided, using data from the NIH/NIA Baltimore Longitudinal Study on Aging.
Multivariate mixed linear model analysis of longitudinal data: an information-rich statistical technique for analyzing disease resistance data

Science.gov (United States)

The mixed linear model (MLM) is currently among the most advanced and flexible statistical modeling techniques and its use in tackling problems in plant pathology has begun surfacing in the literature. The longitudinal MLM is a multivariate extension that handles repeatedly measured data, such as r...
Pilot points method for conditioning multiple-point statistical facies simulation on flow data

Science.gov (United States)

Ma, Wei; Jafarpour, Behnam

2018-05-01

We propose a new pilot points method for conditioning discrete multiple-point statistical (MPS) facies simulation on dynamic flow data. While conditioning MPS simulation on static hard data is straightforward, their calibration against nonlinear flow data is nontrivial. The proposed method generates conditional models from a conceptual model of geologic connectivity, known as a training image (TI), by strategically placing and estimating pilot points. To place pilot points, a score map is generated based on three sources of information: (i) the uncertainty in facies distribution, (ii) the model response sensitivity information, and (iii) the observed flow data. Once the pilot points are placed, the facies values at these points are inferred from production data and then are used, along with available hard data at well locations, to simulate a new set of conditional facies realizations. While facies estimation at the pilot points can be performed using different inversion algorithms, in this study the ensemble smoother (ES) is adopted to update permeability maps from production data, which are then used to statistically infer facies types at the pilot point locations. The developed method combines the information in the flow data and the TI by using the former to infer facies values at selected locations away from the wells and the latter to ensure consistent facies structure and connectivity where away from measurement locations. Several numerical experiments are used to evaluate the performance of the developed method and to discuss its important properties.
PANDA-view: An easy-to-use tool for statistical analysis and visualization of quantitative proteomics data.

Science.gov (United States)

Chang, Cheng; Xu, Kaikun; Guo, Chaoping; Wang, Jinxia; Yan, Qi; Zhang, Jian; He, Fuchu; Zhu, Yunping

2018-05-22

Compared with the numerous software tools developed for identification and quantification of -omics data, there remains a lack of suitable tools for both downstream analysis and data visualization. To help researchers better understand the biological meanings in their -omics data, we present an easy-to-use tool, named PANDA-view, for both statistical analysis and visualization of quantitative proteomics data and other -omics data. PANDA-view contains various kinds of analysis methods such as normalization, missing value imputation, statistical tests, clustering and principal component analysis, as well as the most commonly-used data visualization methods including an interactive volcano plot. Additionally, it provides user-friendly interfaces for protein-peptide-spectrum representation of the quantitative proteomics data. PANDA-view is freely available at https://sourceforge.net/projects/panda-view/. 1987ccpacer@163.com and zhuyunping@gmail.com. Supplementary data are available at Bioinformatics online.
Statistical evaluation of internal contamination data in the man following the Chernobyl accident

International Nuclear Information System (INIS)

Tarroni, G.; Battisti, P.; Melandri, C.; Castellani, C.M.; Formignani, M.

1989-01-01

The main implications of the general interest derived from the statistical analysis of the internal human contamination data obtained by ENEA-PAS with Whole Body Counter mesurements performed in Bologna in consequence of the Chernobyl accident are presented. In particular the trend with time of the individual body activity of members of a homogeneous group, the variability of individual contamination in ralation to the mean contamination, the statistical distribution of the data, the significance of mean values concerning small, homogeneous groups of subjects, the difference between subjects of different sex and its trend with time, are examined. Finally, the substantial independence of the individual committed dose equivalent evaluation due to the Chernobyl contamination on the Whole from the hypothesized values of the metabolic parameters is pointed out when the evaluation is performed on the basis of direct measurements with a Whole Body Counter
Statistics with JMP graphs, descriptive statistics and probability

CERN Document Server

Goos, Peter

2015-01-01

Peter Goos, Department of Statistics, University ofLeuven, Faculty of Bio-Science Engineering and University ofAntwerp, Faculty of Applied Economics, BelgiumDavid Meintrup, Department of Mathematics and Statistics,University of Applied Sciences Ingolstadt, Faculty of MechanicalEngineering, GermanyThorough presentation of introductory statistics and probabilitytheory, with numerous examples and applications using JMPDescriptive Statistics and Probability provides anaccessible and thorough overview of the most important descriptivestatistics for nominal, ordinal and quantitative data withpartic
High-dimensional data: p >> n in mathematical statistics and bio-medical applications

OpenAIRE

Van De Geer, Sara A.; Van Houwelingen, Hans C.

2004-01-01

The workshop 'High-dimensional data: p >> n in mathematical statistics and bio-medical applications' was held at the Lorentz Center in Leiden from 9 to 20 September 2002. This special issue of Bernoulli contains a selection of papers presented at that workshop. ¶ The introduction of high-throughput micro-array technology to measure gene-expression levels and the publication of the pioneering paper by Golub et al. (1999) has brought to life a whole new branch of data analysis under the name of...
Parity Specific Birth Rates for West Germany: An Attempt to Combine Survey Data and Vital Statistics

OpenAIRE

Kreyenfeld, Michaela

2014-01-01

In this paper, we combine vital statistics and survey data to obtain parity specific birth rates for West Germany. Since vital statistics do not provide birth parity information, one is confined to using estimates. The robustness of these estimates is an issue, which is unfortunately only rarely addressed when fertility indicators for (West) Germany are reported. In order to check how reliable our results are, we estimate confidence intervals and compare them to results from survey data and e...

A spatial scan statistic for compound Poisson data.

Science.gov (United States)

Rosychuk, Rhonda J; Chang, Hsing-Ming

2013-12-20

The topic of spatial cluster detection gained attention in statistics during the late 1980s and early 1990s. Effort has been devoted to the development of methods for detecting spatial clustering of cases and events in the biological sciences, astronomy and epidemiology. More recently, research has examined detecting clusters of correlated count data associated with health conditions of individuals. Such a method allows researchers to examine spatial relationships of disease-related events rather than just incident or prevalent cases. We introduce a spatial scan test that identifies clusters of events in a study region. Because an individual case may have multiple (repeated) events, we base the test on a compound Poisson model. We illustrate our method for cluster detection on emergency department visits, where individuals may make multiple disease-related visits. Copyright © 2013 John Wiley & Sons, Ltd.
Integrated Data Collection Analysis (IDCA) Program - Statistical Analysis of RDX Standard Data Sets

Energy Technology Data Exchange (ETDEWEB)

Sandstrom, Mary M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Brown, Geoffrey W. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Preston, Daniel N. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Pollard, Colin J. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Warner, Kirstin F. [Naval Surface Warfare Center (NSWC), Indian Head, MD (United States). Indian Head Division; Sorensen, Daniel N. [Naval Surface Warfare Center (NSWC), Indian Head, MD (United States). Indian Head Division; Remmers, Daniel L. [Naval Surface Warfare Center (NSWC), Indian Head, MD (United States). Indian Head Division; Phillips, Jason J. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Shelley, Timothy J. [Air Force Research Lab. (AFRL), Tyndall AFB, FL (United States); Reyes, Jose A. [Applied Research Associates, Tyndall AFB, FL (United States); Hsu, Peter C. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Reynolds, John G. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

2015-10-30

The Integrated Data Collection Analysis (IDCA) program is conducting a Proficiency Test for Small- Scale Safety and Thermal (SSST) testing of homemade explosives (HMEs). Described here are statistical analyses of the results for impact, friction, electrostatic discharge, and differential scanning calorimetry analysis of the RDX Type II Class 5 standard. The material was tested as a well-characterized standard several times during the proficiency study to assess differences among participants and the range of results that may arise for well-behaved explosive materials. The analyses show that there are detectable differences among the results from IDCA participants. While these differences are statistically significant, most of them can be disregarded for comparison purposes to assess potential variability when laboratories attempt to measure identical samples using methods assumed to be nominally the same. The results presented in this report include the average sensitivity results for the IDCA participants and the ranges of values obtained. The ranges represent variation about the mean values of the tests of between 26% and 42%. The magnitude of this variation is attributed to differences in operator, method, and environment as well as the use of different instruments that are also of varying age. The results appear to be a good representation of the broader safety testing community based on the range of methods, instruments, and environments included in the IDCA Proficiency Test.
Implementation of statistical analysis methods for medical physics data

International Nuclear Information System (INIS)

Teixeira, Marilia S.; Pinto, Nivia G.P.; Barroso, Regina C.; Oliveira, Luis F.

2009-01-01

The objective of biomedical research with different radiation natures is to contribute for the understanding of the basic physics and biochemistry of the biological systems, the disease diagnostic and the development of the therapeutic techniques. The main benefits are: the cure of tumors through the therapy, the anticipated detection of diseases through the diagnostic, the using as prophylactic mean for blood transfusion, etc. Therefore, for the better understanding of the biological interactions occurring after exposure to radiation, it is necessary for the optimization of therapeutic procedures and strategies for reduction of radioinduced effects. The group pf applied physics of the Physics Institute of UERJ have been working in the characterization of biological samples (human tissues, teeth, saliva, soil, plants, sediments, air, water, organic matrixes, ceramics, fossil material, among others) using X-rays diffraction and X-ray fluorescence. The application of these techniques for measurement, analysis and interpretation of the biological tissues characteristics are experimenting considerable interest in the Medical and Environmental Physics. All quantitative data analysis must be initiated with descriptive statistic calculation (means and standard deviations) in order to obtain a previous notion on what the analysis will reveal. It is well known que o high values of standard deviation found in experimental measurements of biologicals samples can be attributed to biological factors, due to the specific characteristics of each individual (age, gender, environment, alimentary habits, etc). This work has the main objective the development of a program for the use of specific statistic methods for the optimization of experimental data an analysis. The specialized programs for this analysis are proprietary, another objective of this work is the implementation of a code which is free and can be shared by the other research groups. As the program developed since the
Statistical models for the analysis of water distribution system pipe break data

International Nuclear Information System (INIS)

Yamijala, Shridhar; Guikema, Seth D.; Brumbelow, Kelly

2009-01-01

The deterioration of pipes leading to pipe breaks and leaks in urban water distribution systems is of concern to water utilities throughout the world. Pipe breaks and leaks may result in reduction in the water-carrying capacity of the pipes and contamination of water in the distribution systems. Water utilities incur large expenses in the replacement and rehabilitation of water mains, making it critical to evaluate the current and future condition of the system for maintenance decision-making. This paper compares different statistical regression models proposed in the literature for estimating the reliability of pipes in a water distribution system on the basis of short time histories. The goals of these models are to estimate the likelihood of pipe breaks in the future and determine the parameters that most affect the likelihood of pipe breaks. The data set used for the analysis comes from a major US city, and these data include approximately 85,000 pipe segments with nearly 2500 breaks from 2000 through 2005. The results show that the set of statistical models previously proposed for this problem do not provide good estimates with the test data set. However, logistic generalized linear models do provide good estimates of pipe reliability and can be useful for water utilities in planning pipe inspection and maintenance
Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro

Directory of Open Access Journals (Sweden)

Matthias Templ

2015-10-01

The R package sdcMicro serves as an easy-to-handle, object-oriented S4 class implementation of SDC methods to evaluate and anonymize confidential micro-data sets. It includes all popular disclosure risk and perturbation methods. The package performs automated recalculation of frequency counts, individual and global risk measures, information loss and data utility statistics after each anonymization step. All methods are highly optimized in terms of computational costs to be able to work with large data sets. Reporting facilities that summarize the anonymization process can also be easily used by practitioners. We describe the package and demonstrate its functionality with a complex household survey test data set that has been distributed by the International Household Survey Network.
Research on cloud background infrared radiation simulation based on fractal and statistical data

Science.gov (United States)

Liu, Xingrun; Xu, Qingshan; Li, Xia; Wu, Kaifeng; Dong, Yanbing

2018-02-01

Cloud is an important natural phenomenon, and its radiation causes serious interference to infrared detector. Based on fractal and statistical data, a method is proposed to realize cloud background simulation, and cloud infrared radiation data field is assigned using satellite radiation data of cloud. A cloud infrared radiation simulation model is established using matlab, and it can generate cloud background infrared images for different cloud types (low cloud, middle cloud, and high cloud) in different months, bands and sensor zenith angles.
Data on electrical energy conservation using high efficiency motors for the confidence bounds using statistical techniques.

Science.gov (United States)

Shaikh, Muhammad Mujtaba; Memon, Abdul Jabbar; Hussain, Manzoor

2016-09-01

In this article, we describe details of the data used in the research paper "Confidence bounds for energy conservation in electric motors: An economical solution using statistical techniques" [1]. The data presented in this paper is intended to show benefits of high efficiency electric motors over the standard efficiency motors of similar rating in the industrial sector of Pakistan. We explain how the data was collected and then processed by means of formulas to show cost effectiveness of energy efficient motors in terms of three important parameters: annual energy saving, cost saving and payback periods. This data can be further used to construct confidence bounds for the parameters using statistical techniques as described in [1].
Statistical data of the uranium industry

International Nuclear Information System (INIS)

1976-01-01

Historical facts and figures of the uranium industry through 1975 are compiled. Areas covered are ore and concentrate purchases; uranium resources; distribution of $10, $15, and $30 reserves; drilling statistics; uranium exploration expenditures; land holdings for uranium mining and exploration; employment; commercial U 3 O 8 sales and requirements; and processing mills
Statistical analysis and data mining of digital reconstructions of dendritic morphologies

Directory of Open Access Journals (Sweden)

Sridevi ePolavaram

2014-12-01

Full Text Available Neuronal morphology is diverse among animal species, developmental stages, brain regions, and cell types. The geometry of individual neurons also varies substantially even within the same cell class. Moreover, specific histological, imaging, and reconstruction methodologies can differentially affect morphometric measures. The quantitative characterization of neuronal arbors is necessary for in-depth understanding of the structure-function relationship in nervous systems. The large collection of community-contributed digitally reconstructed neurons available at NeuroMorpho.Org constitutes a big data research opportunity for neuroscience discovery beyond the approaches typically pursued in single laboratories. To illustrate these potential and related challenges, we present a database-wide statistical analysis of dendritic arbors enabling the quantification of major morphological similarities and differences across broadly adopted metadata categories. Furthermore, we adopt a complementary unsupervised approach based on clustering and dimensionality reduction to identify the main morphological parameters leading to the most statistically informative structural classification. We find that specific combinations of measures related to branching density, overall size, tortuosity, bifurcation angles, arbor flatness, and topological asymmetry can capture anatomically and functionally relevant features of dendritic trees. The reported results only represent a small fraction of the relationships available for data exploration and hypothesis testing enabled by digital sharing of morphological reconstructions.
Statistical analysis and data mining of digital reconstructions of dendritic morphologies.

Science.gov (United States)

Polavaram, Sridevi; Gillette, Todd A; Parekh, Ruchi; Ascoli, Giorgio A

2014-01-01

Neuronal morphology is diverse among animal species, developmental stages, brain regions, and cell types. The geometry of individual neurons also varies substantially even within the same cell class. Moreover, specific histological, imaging, and reconstruction methodologies can differentially affect morphometric measures. The quantitative characterization of neuronal arbors is necessary for in-depth understanding of the structure-function relationship in nervous systems. The large collection of community-contributed digitally reconstructed neurons available at NeuroMorpho.Org constitutes a "big data" research opportunity for neuroscience discovery beyond the approaches typically pursued in single laboratories. To illustrate these potential and related challenges, we present a database-wide statistical analysis of dendritic arbors enabling the quantification of major morphological similarities and differences across broadly adopted metadata categories. Furthermore, we adopt a complementary unsupervised approach based on clustering and dimensionality reduction to identify the main morphological parameters leading to the most statistically informative structural classification. We find that specific combinations of measures related to branching density, overall size, tortuosity, bifurcation angles, arbor flatness, and topological asymmetry can capture anatomically and functionally relevant features of dendritic trees. The reported results only represent a small fraction of the relationships available for data exploration and hypothesis testing enabled by sharing of digital morphological reconstructions.
Statistical analysis of CCSN/SS7 traffic data from working CCS subnetworks

Science.gov (United States)

Duffy, Diane E.; McIntosh, Allen A.; Rosenstein, Mark; Willinger, Walter

1994-04-01

In this paper, we report on an ongoing statistical analysis of actual CCSN traffic data. The data consist of approximately 170 million signaling messages collected from a variety of different working CCS subnetworks. The key findings from our analysis concern: (1) the characteristics of both the telephone call arrival process and the signaling message arrival process; (2) the tail behavior of the call holding time distribution; and (3) the observed performance of the CCSN with respect to a variety of performance and reliability measurements.
Study design and statistical analysis of data in human population studies with the micronucleus assay.

Science.gov (United States)

Ceppi, Marcello; Gallo, Fabio; Bonassi, Stefano

2011-01-01

The most common study design performed in population studies based on the micronucleus (MN) assay, is the cross-sectional study, which is largely performed to evaluate the DNA damaging effects of exposure to genotoxic agents in the workplace, in the environment, as well as from diet or lifestyle factors. Sample size is still a critical issue in the design of MN studies since most recent studies considering gene-environment interaction, often require a sample size of several hundred subjects, which is in many cases difficult to achieve. The control of confounding is another major threat to the validity of causal inference. The most popular confounders considered in population studies using MN are age, gender and smoking habit. Extensive attention is given to the assessment of effect modification, given the increasing inclusion of biomarkers of genetic susceptibility in the study design. Selected issues concerning the statistical treatment of data have been addressed in this mini-review, starting from data description, which is a critical step of statistical analysis, since it allows to detect possible errors in the dataset to be analysed and to check the validity of assumptions required for more complex analyses. Basic issues dealing with statistical analysis of biomarkers are extensively evaluated, including methods to explore the dose-response relationship among two continuous variables and inferential analysis. A critical approach to the use of parametric and non-parametric methods is presented, before addressing the issue of most suitable multivariate models to fit MN data. In the last decade, the quality of statistical analysis of MN data has certainly evolved, although even nowadays only a small number of studies apply the Poisson model, which is the most suitable method for the analysis of MN data.
Missing data imputation using statistical and machine learning methods in a real breast cancer problem.

Science.gov (United States)

Jerez, José M; Molina, Ignacio; García-Laencina, Pedro J; Alba, Emilio; Ribelles, Nuria; Martín, Miguel; Franco, Leonardo

2010-10-01

Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures. Copyright © 2010 Elsevier B.V. All rights reserved.
Sources of Safety Data and Statistical Strategies for Design and Analysis: Transforming Data Into Evidence.

Science.gov (United States)

Ma, Haijun; Russek-Cohen, Estelle; Izem, Rima; Marchenko, Olga V; Jiang, Qi

2018-03-01

Safety evaluation is a key aspect of medical product development. It is a continual and iterative process requiring thorough thinking, and dedicated time and resources. In this article, we discuss how safety data are transformed into evidence to establish and refine the safety profile of a medical product, and how the focus of safety evaluation, data sources, and statistical methods change throughout a medical product's life cycle. Some challenges and statistical strategies for medical product safety evaluation are discussed. Examples of safety issues identified in different periods, that is, premarketing and postmarketing, are discussed to illustrate how different sources are used in the safety signal identification and the iterative process of safety assessment. The examples highlighted range from commonly used pediatric vaccine given to healthy children to medical products primarily used to treat a medical condition in adults. These case studies illustrate that different products may require different approaches, and once a signal is discovered, it could impact future safety assessments. Many challenges still remain in this area despite advances in methodologies, infrastructure, public awareness, international harmonization, and regulatory enforcement. Innovations in safety assessment methodologies are pressing in order to make the medical product development process more efficient and effective, and the assessment of medical product marketing approval more streamlined and structured. Health care payers, providers, and patients may have different perspectives when weighing in on clinical, financial and personal needs when therapies are being evaluated.
Aortic Aneurysm Statistics

Science.gov (United States)

... Summary Coverdell Program 2012-2015 State Summaries Data & Statistics Fact Sheets Heart Disease and Stroke Fact Sheets ... Roadmap for State Planning Other Data Resources Other Statistic Resources Grantee Information Cross-Program Information Online Tools ...
A statistical power analysis of woody carbon flux from forest inventory data

Science.gov (United States)

James A. Westfall; Christopher W. Woodall; Mark A. Hatfield

2013-01-01

At a national scale, the carbon (C) balance of numerous forest ecosystem C pools can be monitored using a stock change approach based on national forest inventory data. Given the potential influence of disturbance events and/or climate change processes, the statistical detection of changes in forest C stocks is paramount to maintaining the net sequestration status of...
The bench scientist's guide to statistical analysis of RNA-Seq data

OpenAIRE

Yendrek, Craig R.; Ainsworth, Elizabeth A.; Thimmapuram, Jyothi

2012-01-01

Abstract Background RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatics specialists. Here we provide a step-by-step guide and outline a strategy using currently available statistical tools that results in a conservative list of differentially expressed genes. We also discuss potential sources of err...
Automatic Derivation of Statistical Data Analysis Algorithms: Planetary Nebulae and Beyond

OpenAIRE

Fischer, Bernd; Knuth, Kevin; Hajian, Arsen; Schumann, Johann

2004-01-01

AUTOBAYES is a fully automatic program synthesis system for the data analysis domain. Its input is a declarative problem description in form of a statistical model; its output is documented and optimized C/C++ code. The synthesis process relies on the combination of three key techniques. Bayesian networks are used as a compact internal representation mechanism which enables problem decompositions and guides the algorithm derivation. Program schemas are used as independently composable buildin...
Statistical intercomparison of global climate models: A common principal component approach with application to GCM data

International Nuclear Information System (INIS)

Sengupta, S.K.; Boyle, J.S.

1993-05-01

Variables describing atmospheric circulation and other climate parameters derived from various GCMs and obtained from observations can be represented on a spatio-temporal grid (lattice) structure. The primary objective of this paper is to explore existing as well as some new statistical methods to analyze such data structures for the purpose of model diagnostics and intercomparison from a statistical perspective. Among the several statistical methods considered here, a new method based on common principal components appears most promising for the purpose of intercomparison of spatio-temporal data structures arising in the task of model/model and model/data intercomparison. A complete strategy for such an intercomparison is outlined. The strategy includes two steps. First, the commonality of spatial structures in two (or more) fields is captured in the common principal vectors. Second, the corresponding principal components obtained as time series are then compared on the basis of similarities in their temporal evolution
RBioplot: an easy-to-use R pipeline for automated statistical analysis and data visualization in molecular biology and biochemistry

Directory of Open Access Journals (Sweden)

Jing Zhang

2016-09-01

Full Text Available Background Statistical analysis and data visualization are two crucial aspects in molecular biology and biology. For analyses that compare one dependent variable between standard (e.g., control and one or multiple independent variables, a comprehensive yet highly streamlined solution is valuable. The computer programming language R is a popular platform for researchers to develop tools that are tailored specifically for their research needs. Here we present an R package RBioplot that takes raw input data for automated statistical analysis and plotting, highly compatible with various molecular biology and biochemistry lab techniques, such as, but not limited to, western blotting, PCR, and enzyme activity assays. Method The package is built based on workflows operating on a simple raw data layout, with minimum user input or data manipulation required. The package is distributed through GitHub, which can be easily installed through one single-line R command. A detailed installation guide is available at http://kenstoreylab.com/?page_id=2448. Users can also download demo datasets from the same website. Results and Discussion By integrating selected functions from existing statistical and data visualization packages with extensive customization, RBioplot features both statistical analysis and data visualization functionalities. Key properties of RBioplot include: -Fully automated and comprehensive statistical analysis, including normality test, equal variance test, Student’s t-test and ANOVA (with post-hoc tests; -Fully automated histogram, heatmap and joint-point curve plotting modules; -Detailed output files for statistical analysis, data manipulation and high quality graphs; -Axis range finding and user customizable tick settings; -High user-customizability.

Inferential statistics of electron backscatter diffraction data from within individual crystalline grains

DEFF Research Database (Denmark)

Bachmann, Florian; Hielscher, Ralf; Jupp, Peter E.

2010-01-01

-spatial statistical analysis adapts ideas borrowed from the Bingham quaternion distribution on . Special emphasis is put on the mathematical definition and the numerical determination of a `mean orientation' characterizing the crystallographic grain as well as on distinguishing several types of symmetry......Highly concentrated distributed crystallographic orientation measurements within individual crystalline grains are analysed by means of ordinary statistics neglecting their spatial reference. Since crystallographic orientations are modelled as left cosets of a given subgroup of SO(3), the non...... of the orientation distribution with respect to the mean orientation, like spherical, prolate or oblate symmetry. Applications to simulated as well as to experimental data are presented. All computations have been done with the free and open-source texture toolbox MTEX....
Methods in pharmacoepidemiology: a review of statistical analyses and data reporting in pediatric drug utilization studies.

Science.gov (United States)

Sequi, Marco; Campi, Rita; Clavenna, Antonio; Bonati, Maurizio

2013-03-01

To evaluate the quality of data reporting and statistical methods performed in drug utilization studies in the pediatric population. Drug utilization studies evaluating all drug prescriptions to children and adolescents published between January 1994 and December 2011 were retrieved and analyzed. For each study, information on measures of exposure/consumption, the covariates considered, descriptive and inferential analyses, statistical tests, and methods of data reporting was extracted. An overall quality score was created for each study using a 12-item checklist that took into account the presence of outcome measures, covariates of measures, descriptive measures, statistical tests, and graphical representation. A total of 22 studies were reviewed and analyzed. Of these, 20 studies reported at least one descriptive measure. The mean was the most commonly used measure (18 studies), but only five of these also reported the standard deviation. Statistical analyses were performed in 12 studies, with the chi-square test being the most commonly performed test. Graphs were presented in 14 papers. Sixteen papers reported the number of drug prescriptions and/or packages, and ten reported the prevalence of the drug prescription. The mean quality score was 8 (median 9). Only seven of the 22 studies received a score of ≥10, while four studies received a score of statistical methods and reported data in a satisfactory manner. We therefore conclude that the methodology of drug utilization studies needs to be improved.
75 FR 39265 - National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards...

Science.gov (United States)

2010-07-08

... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards Staff, Announces the... Prevention, Classifications and Public Health Data Standards, 3311 Toledo Road, Room 2337, Hyattsville, MD...
78 FR 53148 - National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards...

Science.gov (United States)

2013-08-28

... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards Staff, Announces the... Administrator, Classifications and Public Health Data Standards Staff, NCHS, 3311 Toledo Road, Room 2337...
78 FR 9055 - National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards...

Science.gov (United States)

2013-02-07

... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards Staff, Announces the..., Medical Systems Administrator, Classifications and Public Health Data Standards Staff, NCHS, 3311 Toledo...
Statistical Literacy: High School Students in Reading, Interpreting and Presenting Data

Science.gov (United States)

Hafiyusholeh, M.; Budayasa, K.; Siswono, T. Y. E.

2018-01-01

One of the foundations for high school students in statistics is to be able to read data; presents data in the form of tables and diagrams and its interpretation. The purpose of this study is to describe high school students’ competencies in reading, interpreting and presenting data. Subjects were consisted of male and female students who had high levels of mathematical ability. Collecting data was done in form of task formulation which is analyzed by reducing, presenting and verifying data. Results showed that the students read the data based on explicit explanations on the diagram, such as explaining the points in the diagram as the relation between the x and y axis and determining the simple trend of a graph, including the maximum and minimum point. In interpreting and summarizing the data, both subjects pay attention to general data trends and use them to predict increases or decreases in data. The male estimates the value of the (n+1) of weight data by using the modus of the data, while the females estimate the weigth by using the average. The male tend to do not consider the characteristics of the data, while the female more carefully consider the characteristics of data.
Inference of missing data and chemical model parameters using experimental statistics

Science.gov (United States)

Casey, Tiernan; Najm, Habib

2017-11-01

A method for determining the joint parameter density of Arrhenius rate expressions through the inference of missing experimental data is presented. This approach proposes noisy hypothetical data sets from target experiments and accepts those which agree with the reported statistics, in the form of nominal parameter values and their associated uncertainties. The data exploration procedure is formalized using Bayesian inference, employing maximum entropy and approximate Bayesian computation methods to arrive at a joint density on data and parameters. The method is demonstrated in the context of reactions in the H2-O2 system for predictive modeling of combustion systems of interest. Work supported by the US DOE BES CSGB. Sandia National Labs is a multimission lab managed and operated by Nat. Technology and Eng'g Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell Intl, for the US DOE NCSA under contract DE-NA-0003525.
How to interpret the results of medical time series data analysis: Classical statistical approaches versus dynamic Bayesian network modeling.

Science.gov (United States)

Onisko, Agnieszka; Druzdzel, Marek J; Austin, R Marshall

2016-01-01

Classical statistics is a well-established approach in the analysis of medical data. While the medical community seems to be familiar with the concept of a statistical analysis and its interpretation, the Bayesian approach, argued by many of its proponents to be superior to the classical frequentist approach, is still not well-recognized in the analysis of medical data. The goal of this study is to encourage data analysts to use the Bayesian approach, such as modeling with graphical probabilistic networks, as an insightful alternative to classical statistical analysis of medical data. This paper offers a comparison of two approaches to analysis of medical time series data: (1) classical statistical approach, such as the Kaplan-Meier estimator and the Cox proportional hazards regression model, and (2) dynamic Bayesian network modeling. Our comparison is based on time series cervical cancer screening data collected at Magee-Womens Hospital, University of Pittsburgh Medical Center over 10 years. The main outcomes of our comparison are cervical cancer risk assessments produced by the three approaches. However, our analysis discusses also several aspects of the comparison, such as modeling assumptions, model building, dealing with incomplete data, individualized risk assessment, results interpretation, and model validation. Our study shows that the Bayesian approach is (1) much more flexible in terms of modeling effort, and (2) it offers an individualized risk assessment, which is more cumbersome for classical statistical approaches.
Statistical Pattern Recognition

CERN Document Server

Webb, Andrew R

2011-01-01

Statistical pattern recognition relates to the use of statistical techniques for analysing data measurements in order to extract information and make justified decisions. It is a very active area of study and research, which has seen many advances in recent years. Applications such as data mining, web searching, multimedia data retrieval, face recognition, and cursive handwriting recognition, all require robust and efficient pattern recognition techniques. This third edition provides an introduction to statistical pattern theory and techniques, with material drawn from a wide range of fields,
Statistics in Schools

Science.gov (United States)

Information Statistics in Schools Educate your students about the value and everyday use of statistics. The Statistics in Schools program provides resources for teaching and learning with real life data. Explore the site for standards-aligned, classroom-ready activities. Statistics in Schools Math Activities History
Uterine Cancer Statistics

Science.gov (United States)

... Doing AMIGAS Stay Informed Cancer Home Uterine Cancer Statistics Language: English (US) Español (Spanish) Recommend on Facebook ... the most commonly diagnosed gynecologic cancer. U.S. Cancer Statistics Data Visualizations Tool The Data Visualizations tool makes ...
A comparison of Probability Of Detection (POD) data determined using different statistical methods

Science.gov (United States)

Fahr, A.; Forsyth, D.; Bullock, M.

1993-12-01

Different statistical methods have been suggested for determining probability of detection (POD) data for nondestructive inspection (NDI) techniques. A comparative assessment of various methods of determining POD was conducted using results of three NDI methods obtained by inspecting actual aircraft engine compressor disks which contained service induced cracks. The study found that the POD and 95 percent confidence curves as a function of crack size as well as the 90/95 percent crack length vary depending on the statistical method used and the type of data. The distribution function as well as the parameter estimation procedure used for determining POD and the confidence bound must be included when referencing information such as the 90/95 percent crack length. The POD curves and confidence bounds determined using the range interval method are very dependent on information that is not from the inspection data. The maximum likelihood estimators (MLE) method does not require such information and the POD results are more reasonable. The log-logistic function appears to model POD of hit/miss data relatively well and is easy to implement. The log-normal distribution using MLE provides more realistic POD results and is the preferred method. Although it is more complicated and slower to calculate, it can be implemented on a common spreadsheet program.
Characterizing and Addressing the Need for Statistical Adjustment of Global Climate Model Data

Science.gov (United States)

White, K. D.; Baker, B.; Mueller, C.; Villarini, G.; Foley, P.; Friedman, D.

2017-12-01

As part of its mission to research and measure the effects of the changing climate, the U. S. Army Corps of Engineers (USACE) regularly uses the World Climate Research Programme's Coupled Model Intercomparison Project Phase 5 (CMIP5) multi-model dataset. However, these data are generated at a global level and are not fine-tuned for specific watersheds. This often causes CMIP5 output to vary from locally observed patterns in the climate. Several downscaling methods have been developed to increase the resolution of the CMIP5 data and decrease systemic differences to support decision-makers as they evaluate results at the watershed scale. Evaluating preliminary comparisons of observed and projected flow frequency curves over the US revealed a simple framework for water resources decision makers to plan and design water resources management measures under changing conditions using standard tools. Using this framework as a basis, USACE has begun to explore to use of statistical adjustment to alter global climate model data to better match the locally observed patterns while preserving the general structure and behavior of the model data. When paired with careful measurement and hypothesis testing, statistical adjustment can be particularly effective at navigating the compromise between the locally observed patterns and the global climate model structures for decision makers.
Data exploration, quality control and statistical analysis of ChIP-exo/nexus experiments.

Science.gov (United States)

Welch, Rene; Chung, Dongjun; Grass, Jeffrey; Landick, Robert; Keles, Sündüz

2017-09-06

ChIP-exo/nexus experiments rely on innovative modifications of the commonly used ChIP-seq protocol for high resolution mapping of transcription factor binding sites. Although many aspects of the ChIP-exo data analysis are similar to those of ChIP-seq, these high throughput experiments pose a number of unique quality control and analysis challenges. We develop a novel statistical quality control pipeline and accompanying R/Bioconductor package, ChIPexoQual, to enable exploration and analysis of ChIP-exo and related experiments. ChIPexoQual evaluates a number of key issues including strand imbalance, library complexity, and signal enrichment of data. Assessment of these features are facilitated through diagnostic plots and summary statistics computed over regions of the genome with varying levels of coverage. We evaluated our QC pipeline with both large collections of public ChIP-exo/nexus data and multiple, new ChIP-exo datasets from Escherichia coli. ChIPexoQual analysis of these datasets resulted in guidelines for using these QC metrics across a wide range of sequencing depths and provided further insights for modelling ChIP-exo data. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Automation method to identify the geological structure of seabed using spatial statistic analysis of echo sounding data

Science.gov (United States)

Kwon, O.; Kim, W.; Kim, J.

2017-12-01

Recently construction of subsea tunnel has been increased globally. For safe construction of subsea tunnel, identifying the geological structure including fault at design and construction stage is more than important. Then unlike the tunnel in land, it's very difficult to obtain the data on geological structure because of the limit in geological survey. This study is intended to challenge such difficulties in a way of developing the technology to identify the geological structure of seabed automatically by using echo sounding data. When investigation a potential site for a deep subsea tunnel, there is the technical and economical limit with borehole of geophysical investigation. On the contrary, echo sounding data is easily obtainable while information reliability is higher comparing to above approaches. This study is aimed at developing the algorithm that identifies the large scale of geological structure of seabed using geostatic approach. This study is based on theory of structural geology that topographic features indicate geological structure. Basic concept of algorithm is outlined as follows; (1) convert the seabed topography to the grid data using echo sounding data, (2) apply the moving window in optimal size to the grid data, (3) estimate the spatial statistics of the grid data in the window area, (4) set the percentile standard of spatial statistics, (5) display the values satisfying the standard on the map, (6) visualize the geological structure on the map. The important elements in this study include optimal size of moving window, kinds of optimal spatial statistics and determination of optimal percentile standard. To determine such optimal elements, a numerous simulations were implemented. Eventually, user program based on R was developed using optimal analysis algorithm. The user program was designed to identify the variations of various spatial statistics. It leads to easy analysis of geological structure depending on variation of spatial statistics
75 FR 56549 - National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards...

Science.gov (United States)

2010-09-16

... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards Staff, Announces the... Public Health Data Standards Staff, NCHS, 3311 Toledo Road, Room 2337, Hyattsville, Maryland 20782, e...
Paired preference data with a no-preference option – Statistical tests for comparison with placebo data

DEFF Research Database (Denmark)

Christensen, Rune Haubo Bojesen; Ennis, John M.; Ennis, Daniel M.

2014-01-01

/preference responses or ties in choice experiments. Food Quality and Preference, 23, 13–17) noted that this proportion can depend on the product category, have proposed that the expected proportion of preference responses within a given category be called an identicality norm, and have argued that knowledge...... of such norms is valuable for more complete interpretation of 2-Alternative Choice (2-AC) data. For instance, these norms can be used to indicate consumer segmentation even with non-replicated data. In this paper, we show that the statistical test suggested by Ennis and Ennis (2012a) behaves poorly and has too...... when ingredient changes are considered for cost-reduction or health initiative purposes....
A novel genome-information content-based statistic for genome-wide association analysis designed for next-generation sequencing data.

Science.gov (United States)

Luo, Li; Zhu, Yun; Xiong, Momiao

2012-06-01

The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T(2), collapsing method, multivariate and collapsing (CMC) method, individual χ(2) test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets.
Statistical behavior of foreshock Langmuir waves observed by the Cluster wideband data plasma wave receiver

Directory of Open Access Journals (Sweden)

K. Sigsbee

2004-07-01

Full Text Available We present the statistics of Langmuir wave amplitudes in the Earth's foreshock using Cluster Wideband Data (WBD Plasma Wave Receiver electric field waveforms from spacecraft 2, 3 and 4 on 26 March 2002. The largest amplitude Langmuir waves were observed by Cluster near the boundary between the foreshock and solar wind, in agreement with earlier studies. The characteristics of the waves were similar for all three spacecraft, suggesting that variations in foreshock structure must occur on scales greater than the 50-100km spacecraft separations. The electric field amplitude probability distributions constructed using waveforms from the Cluster WBD Plasma Wave Receiver generally followed the log-normal statistics predicted by stochastic growth theory for the event studied. Comparison with WBD receiver data from 17 February 2002, when spacecraft 4 was set in a special manual gain mode, suggests non-optimal auto-ranging of the instrument may have had some influence on the statistics.
Statistical behavior of foreshock Langmuir waves observed by the Cluster wideband data plasma wave receiver

Directory of Open Access Journals (Sweden)

K. Sigsbee

2004-07-01

Full Text Available We present the statistics of Langmuir wave amplitudes in the Earth's foreshock using Cluster Wideband Data (WBD Plasma Wave Receiver electric field waveforms from spacecraft 2, 3 and 4 on 26 March 2002. The largest amplitude Langmuir waves were observed by Cluster near the boundary between the foreshock and solar wind, in agreement with earlier studies. The characteristics of the waves were similar for all three spacecraft, suggesting that variations in foreshock structure must occur on scales greater than the 50-100km spacecraft separations. The electric field amplitude probability distributions constructed using waveforms from the Cluster WBD Plasma Wave Receiver generally followed the log-normal statistics predicted by stochastic growth theory for the event studied. Comparison with WBD receiver data from 17 February 2002, when spacecraft 4 was set in a special manual gain mode, suggests non-optimal auto-ranging of the instrument may have had some influence on the statistics.

Use of multivariate statistics to identify unreliable data obtained using CASA.

Science.gov (United States)

Martínez, Luis Becerril; Crispín, Rubén Huerta; Mendoza, Maximino Méndez; Gallegos, Oswaldo Hernández; Martínez, Andrés Aragón

2013-06-01

In order to identify unreliable data in a dataset of motility parameters obtained from a pilot study acquired by a veterinarian with experience in boar semen handling, but without experience in the operation of a computer assisted sperm analysis (CASA) system, a multivariate graphical and statistical analysis was performed. Sixteen boar semen samples were aliquoted then incubated with varying concentrations of progesterone from 0 to 3.33 µg/ml and analyzed in a CASA system. After standardization of the data, Chernoff faces were pictured for each measurement, and a principal component analysis (PCA) was used to reduce the dimensionality and pre-process the data before hierarchical clustering. The first twelve individual measurements showed abnormal features when Chernoff faces were drawn. PCA revealed that principal components 1 and 2 explained 63.08% of the variance in the dataset. Values of principal components for each individual measurement of semen samples were mapped to identify differences among treatment or among boars. Twelve individual measurements presented low values of principal component 1. Confidence ellipses on the map of principal components showed no statistically significant effects for treatment or boar. Hierarchical clustering realized on two first principal components produced three clusters. Cluster 1 contained evaluations of the two first samples in each treatment, each one of a different boar. With the exception of one individual measurement, all other measurements in cluster 1 were the same as observed in abnormal Chernoff faces. Unreliable data in cluster 1 are probably related to the operator inexperience with a CASA system. These findings could be used to objectively evaluate the skill level of an operator of a CASA system. This may be particularly useful in the quality control of semen analysis using CASA systems.
Mortality variation across Australia: descriptive data for states and territories, and statistical divisions.

Science.gov (United States)

Wilkinson, D; Hiller, J; Moss, J; Ryan, P; Worsley, T

2000-06-01

To describe variation in all cause and selected cause-specific mortality rates across Australia. Mortality and population data for 1997 were obtained from the Australian Bureau of Statistics. All cause and selected cause-specific mortality rates were calculated and directly standardised to the 1997 Australian population in 5-year age groups. Selected major causes of death included cancer, coronary artery disease, cerebrovascular disease, diabetes, accidents and suicide. Rates are reported by statistical division, and State and Territory. All cause age-standardised mortality was 6.98 per 1000 in 1997 and this varied 2-fold from a low in the statistical division of Pilbara, Western Australia (5.78, 95% confidence interval 5.06-6.56), to a high in Northern Territory--excluding Darwin (11.30, 10.67-11.98). Similar mortality variation (all p killers. Larger variation (all p suicide (0.6-3.8 per 10,000). Less marked variation was observed when analysed by State and Territory, but Northern Territory consistently has the highest age-standardised mortality rates. Analysed by statistical division, substantial mortality gradients exist across Australia, suggesting an inequitable distribution of the determinants of health. Further research is required to better understand this heterogeneity.
Official Statistics and Statistics Education: Bridging the Gap

Directory of Open Access Journals (Sweden)

Gal Iddo

2017-03-01

Full Text Available This article aims to challenge official statistics providers and statistics educators to ponder on how to help non-specialist adult users of statistics develop those aspects of statistical literacy that pertain to official statistics. We first document the gap in the literature in terms of the conceptual basis and educational materials needed for such an undertaking. We then review skills and competencies that may help adults to make sense of statistical information in areas of importance to society. Based on this review, we identify six elements related to official statistics about which non-specialist adult users should possess knowledge in order to be considered literate in official statistics: (1 the system of official statistics and its work principles; (2 the nature of statistics about society; (3 indicators; (4 statistical techniques and big ideas; (5 research methods and data sources; and (6 awareness and skills for citizens’ access to statistical reports. Based on this ad hoc typology, we discuss directions that official statistics providers, in cooperation with statistics educators, could take in order to (1 advance the conceptualization of skills needed to understand official statistics, and (2 expand educational activities and services, specifically by developing a collaborative digital textbook and a modular online course, to improve public capacity for understanding of official statistics.
Infodemiological data concerning silicosis in the USA in the period 2004–2010 correlating with real-world statistical data

Directory of Open Access Journals (Sweden)

Nicola Luigi Bragazzi

2017-02-01

Full Text Available This article reports data concerning silicosis-related web-activities using Google Trends (GT capturing the Internet behavior in the USA for the period 2004–2010. GT-generated data were then compared with the most recent available epidemiological data of silicosis mortality obtained from the Centers for Disease Control and Prevention for the same study period. Statistically significant correlations with epidemiological data of silicosis (r=0.805, p-value <0.05 and other related web searches were found. The temporal trend well correlated with the epidemiological data, as well as the geospatial distribution of the web-activities with the geographic epidemiology of silicosis.
Analyzing Planck and low redshift data sets with advanced statistical methods

Science.gov (United States)

Eifler, Tim

The recent ESA/NASA Planck mission has provided a key data set to constrain cosmology that is most sensitive to physics of the early Universe, such as inflation and primordial NonGaussianity (Planck 2015 results XIII). In combination with cosmological probes of the LargeScale Structure (LSS), the Planck data set is a powerful source of information to investigate late time phenomena (Planck 2015 results XIV), e.g. the accelerated expansion of the Universe, the impact of baryonic physics on the growth of structure, and the alignment of galaxies in their dark matter halos. It is the main objective of this proposal to re-analyze the archival Planck data, 1) with different, more recently developed statistical methods for cosmological parameter inference, and 2) to combine Planck and ground-based observations in an innovative way. We will make the corresponding analysis framework publicly available and believe that it will set a new standard for future CMB-LSS analyses. Advanced statistical methods, such as the Gibbs sampler (Jewell et al 2004, Wandelt et al 2004) have been critical in the analysis of Planck data. More recently, Approximate Bayesian Computation (ABC, see Weyant et al 2012, Akeret et al 2015, Ishida et al 2015, for cosmological applications) has matured to an interesting tool in cosmological likelihood analyses. It circumvents several assumptions that enter the standard Planck (and most LSS) likelihood analyses, most importantly, the assumption that the functional form of the likelihood of the CMB observables is a multivariate Gaussian. Beyond applying new statistical methods to Planck data in order to cross-check and validate existing constraints, we plan to combine Planck and DES data in a new and innovative way and run multi-probe likelihood analyses of CMB and LSS observables. The complexity of multiprobe likelihood analyses scale (non-linearly) with the level of correlations amongst the individual probes that are included. For the multi
THE GROWTH POINTS OF STATISTICAL METHODS

OpenAIRE

Orlov A. I.

2014-01-01

On the basis of a new paradigm of applied mathematical statistics, data analysis and economic-mathematical methods are identified; we have also discussed five topical areas in which modern applied statistics is developing as well as the other statistical methods, i.e. five "growth points" – nonparametric statistics, robustness, computer-statistical methods, statistics of interval data, statistics of non-numeric data
Statistically based reevaluation of PISC-II round robin test data

International Nuclear Information System (INIS)

Heasler, P.G.; Taylor, T.T.; Doctor, S.R.

1993-05-01

This report presents a re-analysis of an international PISC-II (Programme for Inspection of Steel Components, Phase 2) round-robin inspection results using formal statistical techniques to account for experimental error. The analysis examines US team performance vs. other participants performance,flaw sizing performance and errors associated with flaw sizing, factors influencing flaw detection probability, performance of all participants with respect to recently adopted ASME Section 11 flaw detection performance demonstration requirements, and develops conclusions concerning ultrasonic inspection capability. Inspection data were gathered on four heavy section steel components which included two plates and two nozzle configurations
NEW PARADIGM OF ANALYSIS OF STATISTICAL AND EXPERT DATA IN PROBLEMS OF ECONOMICS AND MANAGEMENT

OpenAIRE

Orlov A. I.

2014-01-01

The article is devoted to the methods of analysis of statistical and expert data in problems of economics and management that are discussed in the framework of scientific specialization "Mathematical methods of economy", including organizational-economic and economic-mathematical modeling, econometrics and statistics, as well as economic aspects of decision theory, systems analysis, cybernetics, operations research. The main provisions of the new paradigm of this scientific and practical fiel...
Statistical analysis of fatigue strain-life data for carbon and low-alloy steels

International Nuclear Information System (INIS)

Keisler, J.; Chopra, O.K.; Shack, W.J.

1994-08-01

The existing fatigue strain vs. life (S-N) data, foreign and domestic, for carbon and low-alloy steels used in the construction of nuclear power plant components have been compiled and categorized according to material, loading, and environmental conditions. A statistical model has been developed for estimating the effects of the various test conditions on fatigue life. The results of a rigorous statistical analysis have been used to estimate the probability of initiating a fatigue crack. Data in the literature were reviewed to evaluate the effects of size, geometry, and surface finish of a component on its fatigue life. The fatigue S-N curves for components have been determined by applying design margins for size, geometry, and surface finish to crack initiation curves estimated from the model. The significance of the effect of environment on the current Code design curve and on the proposed interim design curves for carbon and low-alloy steels presented in NUREG/CR-5999 is discussed
Statistics, data mining, and machine learning in astronomy a practical Python guide for the analysis of survey data

CERN Document Server

Ivezic, Željko; VanderPlas, Jacob T; Gray, Alexander

2014-01-01

As telescopes, detectors, and computers grow ever more powerful, the volume of data at the disposal of astronomers and astrophysicists will enter the petabyte domain, providing accurate measurements for billions of celestial objects. This book provides a comprehensive and accessible introduction to the cutting-edge statistical methods needed to efficiently analyze complex data sets from astronomical surveys such as the Panoramic Survey Telescope and Rapid Response System, the Dark Energy Survey, and the upcoming Large Synoptic Survey Telescope. It serves as a practical handbook for graduate s
The ‘39 steps’: an algorithm for performing statistical analysis of data on energy intake and expenditure

Directory of Open Access Journals (Sweden)

John R. Speakman

2013-03-01

Full Text Available The epidemics of obesity and diabetes have aroused great interest in the analysis of energy balance, with the use of organisms ranging from nematode worms to humans. Although generating energy-intake or -expenditure data is relatively straightforward, the most appropriate way to analyse the data has been an issue of contention for many decades. In the last few years, a consensus has been reached regarding the best methods for analysing such data. To facilitate using these best-practice methods, we present here an algorithm that provides a step-by-step guide for analysing energy-intake or -expenditure data. The algorithm can be used to analyse data from either humans or experimental animals, such as small mammals or invertebrates. It can be used in combination with any commercial statistics package; however, to assist with analysis, we have included detailed instructions for performing each step for three popular statistics packages (SPSS, MINITAB and R. We also provide interpretations of the results obtained at each step. We hope that this algorithm will assist in the statistically appropriate analysis of such data, a field in which there has been much confusion and some controversy.
A high-resolution open biomass burning emission inventory based on statistical data and MODIS observations in mainland China

Science.gov (United States)

Xu, Y.; Fan, M.; Huang, Z.; Zheng, J.; Chen, L.

2017-12-01

Open biomass burning which has adverse effects on air quality and human health is an important source of gas and particulate matter (PM) in China. Current emission estimations of open biomass burning are generally based on single source (alternative to statistical data and satellite-derived data) and thus contain large uncertainty due to the limitation of data. In this study, to quantify the 2015-based amount of open biomass burning, we established a new estimation method for open biomass burning activity levels by combining the bottom-up statistical data and top-down MODIS observations. And three sub-category sources which used different activity data were considered. For open crop residue burning, the "best estimate" of activity data was obtained by averaging the statistical data from China statistical yearbooks and satellite observations from MODIS burned area product MCD64A1 weighted by their uncertainties. For the forest and grassland fires, their activity levels were represented by the combination of statistical data and MODIS active fire product MCD14ML. Using the fire radiative power (FRP) which is considered as a better indicator of active fire level as the spatial allocation surrogate, coarse gridded emissions were reallocated into 3km ×3km grids to get a high-resolution emission inventory. Our results showed that emissions of CO, NOx, SO2, NH3, VOCs, PM2.5, PM10, BC and OC in mainland China were 6607, 427, 84, 79, 1262, 1198, 1222, 159 and 686 Gg/yr, respectively. Among all provinces of China, Henan, Shandong and Heilongjiang were the top three contributors to the total emissions. In this study, the developed open biomass burning emission inventory with a high-resolution could support air quality modeling and policy-making for pollution control.
Meta- and statistical analysis of single-case intervention research data: quantitative gifts and a wish list.

Science.gov (United States)

Kratochwill, Thomas R; Levin, Joel R

2014-04-01

In this commentary, we add to the spirit of the articles appearing in the special series devoted to meta- and statistical analysis of single-case intervention-design data. Following a brief discussion of historical factors leading to our initial involvement in statistical analysis of such data, we discuss: (a) the value added by including statistical-analysis recommendations in the What Works Clearinghouse Standards for single-case intervention designs; (b) the importance of visual analysis in single-case intervention research, along with the distinctive role that could be played by single-case effect-size measures; and (c) the elevated internal validity and statistical-conclusion validity afforded by the incorporation of various forms of randomization into basic single-case design structures. For the future, we envision more widespread application of quantitative analyses, as critical adjuncts to visual analysis, in both primary single-case intervention research studies and literature reviews in the behavioral, educational, and health sciences. Copyright © 2014 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.
Statistical implications of adjustments of raw ILI (In-Line Inspection) data

Energy Technology Data Exchange (ETDEWEB)

Timashev, Svyatoslav A.; Bushinskaya, Anna V. [Russian Academy of Sciences, Ekaterinburg (Russian Federation). Ural Branch. Sciences and Engineering Center ' Reliability and Safety of Large Systems and Machines'

2009-07-01

The paper describes the implications and inferences that inevitably arise when deliberate 'adjustments' of raw MFL ILI data are made when delivering the final report on conducted ILI and/or performing defect sizing and reliability assessments needed for pipeline integrity management plans (IMPs). The root causes of data adjustments are discussed, main types of adjustments are classified and the consequences as related to pipeline residual life, reliability and safety are described. A comparison is performed between adjustment and the full statistical analysis (FSA), as applied to raw ILI and verification data. The consequences of defect data adjustment as related to pipeline reliability and POF and possible litigation issues are discussed. Case studies are presented which demonstrate the application of the FSA method to the results of ILI and verification measurements on pipelines that are located on three continents. Some assessments of the actual reliability of pipelines with defects are given. (author)
Linnorm: improved statistical analysis for single cell RNA-seq expression data.

Science.gov (United States)

Yip, Shun H; Wang, Panwen; Kocher, Jean-Pierre A; Sham, Pak Chung; Wang, Junwen

2017-12-15

Linnorm is a novel normalization and transformation method for the analysis of single cell RNA sequencing (scRNA-seq) data. Linnorm is developed to remove technical noises and simultaneously preserve biological variations in scRNA-seq data, such that existing statistical methods can be improved. Using real scRNA-seq data, we compared Linnorm with existing normalization methods, including NODES, SAMstrt, SCnorm, scran, DESeq and TMM. Linnorm shows advantages in speed, technical noise removal and preservation of cell heterogeneity, which can improve existing methods in the discovery of novel subtypes, pseudo-temporal ordering of cells, clustering analysis, etc. Linnorm also performs better than existing DEG analysis methods, including BASiCS, NODES, SAMstrt, Seurat and DESeq2, in false positive rate control and accuracy. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
A Statistical Primer: Understanding Descriptive and Inferential Statistics

OpenAIRE

Gillian Byrne

2007-01-01

As libraries and librarians move more towards evidence‐based decision making, the data being generated in libraries is growing. Understanding the basics of statistical analysis is crucial for evidence‐based practice (EBP), in order to correctly design and analyze researchas well as to evaluate the research of others. This article covers the fundamentals of descriptive and inferential statistics, from hypothesis construction to sampling to common statistical techniques including chi‐square, co...
A methodology for spatial data selection for statistical downscaling purposes. A case study of precipitation in southwestern Europe

Energy Technology Data Exchange (ETDEWEB)

Woth, K. [GKSS-Forschungszentrum Geesthacht GmbH (Germany). Inst. fuer Kuestenforschung

2001-07-01

In this study, the sensitivity of the estimation of small-scale climate variables using the technique of statistical downscaling is investigated and one method to select the most suitable input data is presented. For the example of precipitation in southwest Europe, the input data are selected systematically by extracting those stations that show a strong statistical relation in time with North Atlantic sea level pressure (SLP). From these stations the sector of North Atlantic SLP is selected that best explains the dominant spatial pattern of regional precipitation. For comparison, one alternative, slightly different geographical box is used. For both sectors a statistical model for the estimation of future rainfall in the southwest of Europe is constructed. It is shown that the method of statistical downscaling is sensitive to small changes of the input data and that the estimations of future precipitation show remarkable differences for the two different Atlantic SLP sectors considered. Possible reasons are discussed. (orig.)
Advanced data analysis in neuroscience integrating statistical and computational models

CERN Document Server

Durstewitz, Daniel

2017-01-01

This book is intended for use in advanced graduate courses in statistics / machine learning, as well as for all experimental neuroscientists seeking to understand statistical methods at a deeper level, and theoretical neuroscientists with a limited background in statistics. It reviews almost all areas of applied statistics, from basic statistical estimation and test theory, linear and nonlinear approaches for regression and classification, to model selection and methods for dimensionality reduction, density estimation and unsupervised clustering. Its focus, however, is linear and nonlinear time series analysis from a dynamical systems perspective, based on which it aims to convey an understanding also of the dynamical mechanisms that could have generated observed time series. Further, it integrates computational modeling of behavioral and neural dynamics with statistical estimation and hypothesis testing. This way computational models in neuroscience are not only explanat ory frameworks, but become powerfu...
Statistical test data selection for reliability evalution of process computer software

International Nuclear Information System (INIS)

Volkmann, K.P.; Hoermann, H.; Ehrenberger, W.

1976-01-01

The paper presents a concept for converting knowledge about the characteristics of process states into practicable procedures for the statistical selection of test cases in testing process computer software. Process states are defined as vectors whose components consist of values of input variables lying in discrete positions or within given limits. Two approaches for test data selection, based on knowledge about cases of demand, are outlined referring to a purely probabilistic method and to the mathematics of stratified sampling. (orig.) [de
Understanding advanced statistical methods

CERN Document Server

Westfall, Peter

2013-01-01

Introduction: Probability, Statistics, and ScienceReality, Nature, Science, and ModelsStatistical Processes: Nature, Design and Measurement, and DataModelsDeterministic ModelsVariabilityParametersPurely Probabilistic Statistical ModelsStatistical Models with Both Deterministic and Probabilistic ComponentsStatistical InferenceGood and Bad ModelsUses of Probability ModelsRandom Variables and Their Probability DistributionsIntroductionTypes of Random Variables: Nominal, Ordinal, and ContinuousDiscrete Probability Distribution FunctionsContinuous Probability Distribution FunctionsSome Calculus-Derivatives and Least SquaresMore Calculus-Integrals and Cumulative Distribution FunctionsProbability Calculation and SimulationIntroductionAnalytic Calculations, Discrete and Continuous CasesSimulation-Based ApproximationGenerating Random NumbersIdentifying DistributionsIntroductionIdentifying Distributions from Theory AloneUsing Data: Estimating Distributions via the HistogramQuantiles: Theoretical and Data-Based Estimate...

An omnibus likelihood test statistic and its factorization for change detection in time series of polarimetric SAR data

DEFF Research Database (Denmark)

Nielsen, Allan Aasbjerg; Conradsen, Knut; Skriver, Henning

2016-01-01

Based on an omnibus likelihood ratio test statistic for the equality of several variance-covariance matrices following the complex Wishart distribution with an associated p-value and a factorization of this test statistic, change analysis in a short sequence of multilook, polarimetric SAR data...... in the covariance matrix representation is carried out. The omnibus test statistic and its factorization detect if and when change(s) occur. The technique is demonstrated on airborne EMISAR L-band data but may be applied to Sentinel-1, Cosmo-SkyMed, TerraSAR-X, ALOS and RadarSat-2 or other dual- and quad...
Change detection in a time series of polarimetric SAR data by an omnibus test statistic and its factorization

DEFF Research Database (Denmark)

Nielsen, Allan Aasbjerg; Conradsen, Knut; Skriver, Henning

2016-01-01

Based on an omnibus likelihood ratio test statistic for the equality of several variance-covariance matrices following the complex Wishart distribution with an associated p-value and a factorization of this test statistic, change analysis in a short sequence of multilook, polarimetric SAR data...... in the covariance matrix representation is carried out. The omnibus test statistic and its factorization detect if and when change(s) occur. The technique is demonstrated on airborne EMISAR L-band data but may be applied to Sentinel-1, Cosmo-SkyMed, TerraSAR-X, ALOS and RadarSat-2 or other dual- and quad...
Data analysis in high energy physics. A practical guide to statistical methods

International Nuclear Information System (INIS)

Behnke, Olaf; Schoerner-Sadenius, Thomas; Kroeninger, Kevin; Schott, Gregory

2013-01-01

This practical guide covers the essential tasks in statistical data analysis encountered in high energy physics and provides comprehensive advice for typical questions and problems. The basic methods for inferring results from data are presented as well as tools for advanced tasks such as improving the signal-to-background ratio, correcting detector effects, determining systematics and many others. Concrete applications are discussed in analysis walkthroughs. Each chapter is supplemented by numerous examples and exercises and by a list of literature and relevant links. The book targets a broad readership at all career levels - from students to senior researchers.
Modular reweighting software for statistical mechanical analysis of biased equilibrium data

Science.gov (United States)

Sindhikara, Daniel J.

2012-07-01

Here a simple, useful, modular approach and software suite designed for statistical reweighting and analysis of equilibrium ensembles is presented. Statistical reweighting is useful and sometimes necessary for analysis of equilibrium enhanced sampling methods, such as umbrella sampling or replica exchange, and also in experimental cases where biasing factors are explicitly known. Essentially, statistical reweighting allows extrapolation of data from one or more equilibrium ensembles to another. Here, the fundamental separable steps of statistical reweighting are broken up into modules - allowing for application to the general case and avoiding the black-box nature of some “all-inclusive” reweighting programs. Additionally, the programs included are, by-design, written with little dependencies. The compilers required are either pre-installed on most systems, or freely available for download with minimal trouble. Examples of the use of this suite applied to umbrella sampling and replica exchange molecular dynamics simulations will be shown along with advice on how to apply it in the general case. New version program summaryProgram title: Modular reweighting version 2 Catalogue identifier: AEJH_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJH_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU General Public License, version 3 No. of lines in distributed program, including test data, etc.: 179 118 No. of bytes in distributed program, including test data, etc.: 8 518 178 Distribution format: tar.gz Programming language: C++, Python 2.6+, Perl 5+ Computer: Any Operating system: Any RAM: 50-500 MB Supplementary material: An updated version of the original manuscript (Comput. Phys. Commun. 182 (2011) 2227) is available Classification: 4.13 Catalogue identifier of previous version: AEJH_v1_0 Journal reference of previous version: Comput. Phys. Commun. 182 (2011) 2227 Does the new
Selecting the right statistical model for analysis of insect count data by using information theoretic measures.

Science.gov (United States)

Sileshi, G

2006-10-01

Researchers and regulatory agencies often make statistical inferences from insect count data using modelling approaches that assume homogeneous variance. Such models do not allow for formal appraisal of variability which in its different forms is the subject of interest in ecology. Therefore, the objectives of this paper were to (i) compare models suitable for handling variance heterogeneity and (ii) select optimal models to ensure valid statistical inferences from insect count data. The log-normal, standard Poisson, Poisson corrected for overdispersion, zero-inflated Poisson, the negative binomial distribution and zero-inflated negative binomial models were compared using six count datasets on foliage-dwelling insects and five families of soil-dwelling insects. Akaike's and Schwarz Bayesian information criteria were used for comparing the various models. Over 50% of the counts were zeros even in locally abundant species such as Ootheca bennigseni Weise, Mesoplatys ochroptera Stål and Diaecoderus spp. The Poisson model after correction for overdispersion and the standard negative binomial distribution model provided better description of the probability distribution of seven out of the 11 insects than the log-normal, standard Poisson, zero-inflated Poisson or zero-inflated negative binomial models. It is concluded that excess zeros and variance heterogeneity are common data phenomena in insect counts. If not properly modelled, these properties can invalidate the normal distribution assumptions resulting in biased estimation of ecological effects and jeopardizing the integrity of the scientific inferences. Therefore, it is recommended that statistical models appropriate for handling these data properties be selected using objective criteria to ensure efficient statistical inference.
A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.

Science.gov (United States)

Reese, Sarah E; Archer, Kellie J; Therneau, Terry M; Atkinson, Elizabeth J; Vachon, Celine M; de Andrade, Mariza; Kocher, Jean-Pierre A; Eckel-Passow, Jeanette E

2013-11-15

Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well. The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article. reesese@vcu.edu
Landslide susceptibility mapping using GIS-based statistical models and Remote sensing data in tropical environment.

Science.gov (United States)

Shahabi, Himan; Hashim, Mazlan

2015-04-22

This research presents the results of the GIS-based statistical models for generation of landslide susceptibility mapping using geographic information system (GIS) and remote-sensing data for Cameron Highlands area in Malaysia. Ten factors including slope, aspect, soil, lithology, NDVI, land cover, distance to drainage, precipitation, distance to fault, and distance to road were extracted from SAR data, SPOT 5 and WorldView-1 images. The relationships between the detected landslide locations and these ten related factors were identified by using GIS-based statistical models including analytical hierarchy process (AHP), weighted linear combination (WLC) and spatial multi-criteria evaluation (SMCE) models. The landslide inventory map which has a total of 92 landslide locations was created based on numerous resources such as digital aerial photographs, AIRSAR data, WorldView-1 images, and field surveys. Then, 80% of the landslide inventory was used for training the statistical models and the remaining 20% was used for validation purpose. The validation results using the Relative landslide density index (R-index) and Receiver operating characteristic (ROC) demonstrated that the SMCE model (accuracy is 96%) is better in prediction than AHP (accuracy is 91%) and WLC (accuracy is 89%) models. These landslide susceptibility maps would be useful for hazard mitigation purpose and regional planning.
Multivariate two-part statistics for analysis of correlated mass spectrometry data from multiple biological specimens.

Science.gov (United States)

Taylor, Sandra L; Ruhaak, L Renee; Weiss, Robert H; Kelly, Karen; Kim, Kyoungmi

2017-01-01

High through-put mass spectrometry (MS) is now being used to profile small molecular compounds across multiple biological sample types from the same subjects with the goal of leveraging information across biospecimens. Multivariate statistical methods that combine information from all biospecimens could be more powerful than the usual univariate analyses. However, missing values are common in MS data and imputation can impact between-biospecimen correlation and multivariate analysis results. We propose two multivariate two-part statistics that accommodate missing values and combine data from all biospecimens to identify differentially regulated compounds. Statistical significance is determined using a multivariate permutation null distribution. Relative to univariate tests, the multivariate procedures detected more significant compounds in three biological datasets. In a simulation study, we showed that multi-biospecimen testing procedures were more powerful than single-biospecimen methods when compounds are differentially regulated in multiple biospecimens but univariate methods can be more powerful if compounds are differentially regulated in only one biospecimen. We provide R functions to implement and illustrate our method as supplementary information CONTACT: sltaylor@ucdavis.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Partial discharge testing: a progress report. Statistical evaluation of PD data

International Nuclear Information System (INIS)

Warren, V.; Allan, J.

2005-01-01

It has long been known that comparing the partial discharge results obtained from a single machine is a valuable tool enabling companies to observe the gradual deterioration of a machine stator winding and thus plan appropriate maintenance for the machine. In 1998, at the annual Iris Rotating Machines Conference (IRMC), a paper was presented that compared thousands of PD test results to establish the criteria for comparing results from different machines and the expected PD levels. At subsequent annual Iris conferences, using similar analytical procedures, papers were presented that supported the previous criteria and: in 1999, established sensor location as an additional criterion; in 2000, evaluated the effect of insulation type and age on PD activity; in 2001, evaluated the effect of manufacturer on PD activity; in 2002, evaluated the effect of operating pressure for hydrogen-cooled machines; in 2003, evaluated the effect of insulation type and setting Trac alarms; in 2004, re-evaluated the effect of manufacturer on PD activity. Before going further in database analysis procedures, it would be prudent to statistically evaluate the anecdotal evidence observed to date. The goal was to determine which variables of machine conditions greatly influenced the PD results and which didn't. Therefore, this year's paper looks at the impact of operating voltage, machine type and winding type on the test results for air-cooled machines. Because of resource constraints, only data collected through 2003 was used; however, as before, it is still standardized for frequency bandwidth and pruned to include only full-load-hot (FLH) results collected for one sensor on operating machines. All questionable data, or data from off-line testing or unusual machine conditions was excluded, leaving 6824 results. Calibration of on-line PD test results is impractical; therefore, only results obtained using the same method of data collection and noise separation techniques are compared. For
Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies.

Science.gov (United States)

Boulesteix, Anne-Laure; Wilson, Rory; Hapfelmeier, Alexander

2017-09-09

The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly "evidence-based". Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of "evidence-based" statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. We suggest that benchmark studies-a method of assessment of statistical methods using real-world datasets-might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.
Statistical methods in regression and calibration analysis of chromosome aberration data

International Nuclear Information System (INIS)

Merkle, W.

1983-01-01

The method of iteratively reweighted least squares for the regression analysis of Poisson distributed chromosome aberration data is reviewed in the context of other fit procedures used in the cytogenetic literature. As an application of the resulting regression curves methods for calculating confidence intervals on dose from aberration yield are described and compared, and, for the linear quadratic model a confidence interval is given. Emphasis is placed on the rational interpretation and the limitations of various methods from a statistical point of view. (orig./MG)
Robust functional statistics applied to Probability Density Function shape screening of sEMG data.

Science.gov (United States)

Boudaoud, S; Rix, H; Al Harrach, M; Marin, F

2014-01-01

Recent studies pointed out possible shape modifications of the Probability Density Function (PDF) of surface electromyographical (sEMG) data according to several contexts like fatigue and muscle force increase. Following this idea, criteria have been proposed to monitor these shape modifications mainly using High Order Statistics (HOS) parameters like skewness and kurtosis. In experimental conditions, these parameters are confronted with small sample size in the estimation process. This small sample size induces errors in the estimated HOS parameters restraining real-time and precise sEMG PDF shape monitoring. Recently, a functional formalism, the Core Shape Model (CSM), has been used to analyse shape modifications of PDF curves. In this work, taking inspiration from CSM method, robust functional statistics are proposed to emulate both skewness and kurtosis behaviors. These functional statistics combine both kernel density estimation and PDF shape distances to evaluate shape modifications even in presence of small sample size. Then, the proposed statistics are tested, using Monte Carlo simulations, on both normal and Log-normal PDFs that mimic observed sEMG PDF shape behavior during muscle contraction. According to the obtained results, the functional statistics seem to be more robust than HOS parameters to small sample size effect and more accurate in sEMG PDF shape screening applications.
Statistical near-real-time accountancy procedures applied to AGNS [Allied General Nuclear Services] minirun data using PROSA

International Nuclear Information System (INIS)

Beedgen, R.

1988-03-01

The computer program PROSA (PROgram for Statistical Analysis of near-real-time accountancy data) was developed as a tool to apply statistical test procedures to a sequence of materials balance results for detecting losses of material. First applications of PROSA to model facility data and real plant data showed that PROSA is also usable as a tool for process or measurement control. To deepen the experience for the application of PROSA to real data of bulk-handling facilities, we applied it to uranium data of the Allied General Nuclear Services miniruns, where accountancy data were collected on a near-real-time basis. Minirun 6 especially was considered, and the pulsed columns were chosen as materials balance area. The structure of the measurement models for flow sheet data and actual operation data are compared, and methods are studied to reduce the error for inventory measurements of the columns
Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining.

Directory of Open Access Journals (Sweden)

Ujjwal Maulik

Full Text Available Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution. The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post
Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining.

Science.gov (United States)

Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra

2015-01-01

Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data
Making Women Count: Gender-Typing, Technology and Path Dependencies in Dutch Statistical Data Processing

NARCIS (Netherlands)

van den Ende, Jan; van Oost, Elizabeth C.J.

2001-01-01

This article is a longitudinal analysis of the relation between gendered labour divisions and new data processing technologies at the Dutch Central Bureau of Statistics (CBS). Following social-constructivist and evolutionary economic approaches, the authors hold that the relation between technology
Statistical Analysis Of Reconnaissance Geochemical Data From ...

African Journals Online (AJOL)

, Co, Mo, Hg, Sb, Tl, Sc, Cr, Ni, La, W, V, U, Th, Bi, Sr and Ga in 56 stream sediment samples collected from Orle drainage system were subjected to univariate and multivariate statistical analyses. The univariate methods used include ...
The International Coal Statistics Data Base program maintenance guide

International Nuclear Information System (INIS)

1991-06-01

The International Coal Statistics Data Base (ICSD) is a microcomputer-based system which contains information related to international coal trade. This includes coal production, consumption, imports and exports information. The ICSD is a secondary data base, meaning that information contained therein is derived entirely from other primary sources. It uses dBase III+ and Lotus 1-2-3 to locate, report and display data. The system is used for analysis in preparing the Annual Prospects for World Coal Trade (DOE/EIA-0363) publication. The ICSD system is menu driven and also permits the user who is familiar with dBase and Lotus operations to leave the menu structure to perform independent queries. Documentation for the ICSD consists of three manuals -- the User's Guide, the Operations Manual, and the Program Maintenance Manual. This Program Maintenance Manual provides the information necessary to maintain and update the ICSD system. Two major types of program maintenance documentation are presented in this manual. The first is the source code for the dBase III+ routines and related non-dBase programs used in operating the ICSD. The second is listings of the major component database field structures. A third important consideration for dBase programming, the structure of index files, is presented in the listing of source code for the index maintenance program. 1 fig
Statistical inference for imperfect maintenance models with missing data

International Nuclear Information System (INIS)

Dijoux, Yann; Fouladirad, Mitra; Nguyen, Dinh Tuan

2016-01-01

The paper considers complex industrial systems with incomplete maintenance history. A corrective maintenance is performed after the occurrence of a failure and its efficiency is assumed to be imperfect. In maintenance analysis, the databases are not necessarily complete. Specifically, the observations are assumed to be window-censored. This situation arises relatively frequently after the purchase of a second-hand unit or in the absence of maintenance record during the burn-in phase. The joint assessment of the wear-out of the system and the maintenance efficiency is investigated under missing data. A review along with extensions of statistical inference procedures from an observation window are proposed in the case of perfect and minimal repair using the renewal and Poisson theories, respectively. Virtual age models are employed to model imperfect repair. In this framework, new estimation procedures are developed. In particular, maximum likelihood estimation methods are derived for the most classical virtual age models. The benefits of the new estimation procedures are highlighted by numerical simulations and an application to a real data set. - Highlights: • New estimation procedures for window-censored observations and imperfect repair. • Extensions of inference methods for perfect and minimal repair with missing data. • Overview of maximum likelihood method with complete and incomplete observations. • Benefits of the new procedures highlighted by simulation studies and real application.
Applying Statistical Process Control to Clinical Data: An Illustration.

Science.gov (United States)

Pfadt, Al; And Others

1992-01-01

Principles of statistical process control are applied to a clinical setting through the use of control charts to detect changes, as part of treatment planning and clinical decision-making processes. The logic of control chart analysis is derived from principles of statistical inference. Sample charts offer examples of evaluating baselines and…

Statistical Validation of Calibrated Wind Data Collected From NOAA's Hurricane Hunter Aircraft

Science.gov (United States)

Graham, K.; Sears, I. T.; Holmes, M.; Henning, R. G.; Damiano, A. B.; Parrish, J. R.; Flaherty, P. T.

2015-12-01

Obtaining accurate in situ meteorological measurements from the NOAA G-IV Hurricane Hunter Aircraft currently requires annual wind calibration flights. This project attempts to demonstrate whether an alternate method to wind calibration flights can be implemented using data collected from many previous hurricane, winter storm, and surveying flights. Wind derivations require using airplane attack and slip angles, airplane pitch, pressure differentials, dynamic pressures, ground speeds, true air speeds, and several other variables measured by instruments on the aircraft. Through the use of linear regression models, future wind measurements may be fit to past statistical models. This method of wind calibration could replace the need for annual wind calibration flights, decreasing NOAA expenses and providing more accurate data. This would help to ensure all data users have reliable data and ultimately contribute to NOAA's goal of building of a Weather Ready Nation.
Statistical methods

CERN Document Server

Szulc, Stefan

1965-01-01

Statistical Methods provides a discussion of the principles of the organization and technique of research, with emphasis on its application to the problems in social statistics. This book discusses branch statistics, which aims to develop practical ways of collecting and processing numerical data and to adapt general statistical methods to the objectives in a given field.Organized into five parts encompassing 22 chapters, this book begins with an overview of how to organize the collection of such information on individual units, primarily as accomplished by government agencies. This text then
Statistics without Tears: Complex Statistics with Simple Arithmetic

Science.gov (United States)

Smith, Brian

2011-01-01

One of the often overlooked aspects of modern statistics is the analysis of time series data. Modern introductory statistics courses tend to rush to probabilistic applications involving risk and confidence. Rarely does the first level course linger on such useful and fascinating topics as time series decomposition, with its practical applications…
Statistical Reporting Errors and Collaboration on Statistical Analyses in Psychological Science.

Science.gov (United States)

Veldkamp, Coosje L S; Nuijten, Michèle B; Dominguez-Alvarez, Linda; van Assen, Marcel A L M; Wicherts, Jelte M

2014-01-01

Statistical analysis is error prone. A best practice for researchers using statistics would therefore be to share data among co-authors, allowing double-checking of executed tasks just as co-pilots do in aviation. To document the extent to which this 'co-piloting' currently occurs in psychology, we surveyed the authors of 697 articles published in six top psychology journals and asked them whether they had collaborated on four aspects of analyzing data and reporting results, and whether the described data had been shared between the authors. We acquired responses for 49.6% of the articles and found that co-piloting on statistical analysis and reporting results is quite uncommon among psychologists, while data sharing among co-authors seems reasonably but not completely standard. We then used an automated procedure to study the prevalence of statistical reporting errors in the articles in our sample and examined the relationship between reporting errors and co-piloting. Overall, 63% of the articles contained at least one p-value that was inconsistent with the reported test statistic and the accompanying degrees of freedom, and 20% of the articles contained at least one p-value that was inconsistent to such a degree that it may have affected decisions about statistical significance. Overall, the probability that a given p-value was inconsistent was over 10%. Co-piloting was not found to be associated with reporting errors.
Assessing thermal comfort and energy efficiency in buildings by statistical quality control for autocorrelated data

International Nuclear Information System (INIS)

Barbeito, Inés; Zaragoza, Sonia; Tarrío-Saavedra, Javier; Naya, Salvador

2017-01-01

Highlights: • Intelligent web platform development for energy efficiency management in buildings. • Controlling and supervising thermal comfort and energy consumption in buildings. • Statistical quality control procedure to deal with autocorrelated data. • Open source alternative using R software. - Abstract: In this paper, a case study of performing a reliable statistical procedure to evaluate the quality of HVAC systems in buildings using data retrieved from an ad hoc big data web energy platform is presented. The proposed methodology based on statistical quality control (SQC) is used to analyze the real state of thermal comfort and energy efficiency of the offices of the company FRIDAMA (Spain) in a reliable way. Non-conformities or alarms, and the actual assignable causes of these out of control states are detected. The capability to meet specification requirements is also analyzed. Tools and packages implemented in the open-source R software are employed to apply the different procedures. First, this study proposes to fit ARIMA time series models to CTQ variables. Then, the application of Shewhart and EWMA control charts to the time series residuals is proposed to control and monitor thermal comfort and energy consumption in buildings. Once thermal comfort and consumption variability are estimated, the implementation of capability indexes for autocorrelated variables is proposed to calculate the degree to which standards specifications are met. According with case study results, the proposed methodology has detected real anomalies in HVAC installation, helping to detect assignable causes and to make appropriate decisions. One of the goals is to perform and describe step by step this statistical procedure in order to be replicated by practitioners in a better way.
Sparse Power-Law Network Model for Reliable Statistical Predictions Based on Sampled Data

Directory of Open Access Journals (Sweden)

Alexander P. Kartun-Giles

2018-04-01

Full Text Available A projective network model is a model that enables predictions to be made based on a subsample of the network data, with the predictions remaining unchanged if a larger sample is taken into consideration. An exchangeable model is a model that does not depend on the order in which nodes are sampled. Despite a large variety of non-equilibrium (growing and equilibrium (static sparse complex network models that are widely used in network science, how to reconcile sparseness (constant average degree with the desired statistical properties of projectivity and exchangeability is currently an outstanding scientific problem. Here we propose a network process with hidden variables which is projective and can generate sparse power-law networks. Despite the model not being exchangeable, it can be closely related to exchangeable uncorrelated networks as indicated by its information theory characterization and its network entropy. The use of the proposed network process as a null model is here tested on real data, indicating that the model offers a promising avenue for statistical network modelling.
Statistical Modelling of Wind Proles - Data Analysis and Modelling

DEFF Research Database (Denmark)

Jónsson, Tryggvi; Pinson, Pierre

The aim of the analysis presented in this document is to investigate whether statistical models can be used to make very short-term predictions of wind profiles.......The aim of the analysis presented in this document is to investigate whether statistical models can be used to make very short-term predictions of wind profiles....
Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies

Directory of Open Access Journals (Sweden)

Anne-Laure Boulesteix

2017-09-01

Full Text Available Abstract Background The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly “evidence-based”. Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. Main message In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of “evidence-based” statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. Conclusion We suggest that benchmark studies—a method of assessment of statistical methods using real-world datasets—might benefit from adopting (some concepts from evidence-based medicine towards the goal of more evidence-based statistical research.
Spatial Statistics and Spatio-Temporal Data Covariance Functions and Directional Properties

CERN Document Server

Sherman, Michael

2010-01-01

In the spatial or space-time context, specifying the correct covariance function is important to obtain efficient predictions and to understand the underlying physical process of interest. There have been several books in recent years in the general area of spatial statistics. This book focuses on covariance and variogram functions, their role in prediction, and the proper choice of these functions in data applications. Presenting recent methods from 2004-2007 alongside more established methodology of assessing the usual assumptions on such functions such as isotropy, separability and symmetry
Statistical process control for serially correlated data

NARCIS (Netherlands)

Wieringa, Jakob Edo

1999-01-01

Statistical Process Control (SPC) aims at quality improvement through reduction of variation. The best known tool of SPC is the control chart. Over the years, the control chart has proved to be a successful practical technique for monitoring process measurements. However, its usefulness in practice
Low-cost data acquisition systems for photovoltaic system monitoring and usage statistics

Science.gov (United States)

Fanourakis, S.; Wang, K.; McCarthy, P.; Jiao, L.

2017-11-01

This paper presents the design of a low-cost data acquisition system for monitoring a photovoltaic system’s electrical quantities, battery temperatures, and state of charge of the battery. The electrical quantities are the voltages and currents of the solar panels, the battery, and the system loads. The system uses an Atmega328p microcontroller to acquire data from the photovoltaic system’s charge controller. It also records individual load information using current sensing resistors along with a voltage amplification circuit and an analog to digital converter. The system is used in conjunction with a wall power data acquisition system for the recording of regional power outages. Both data acquisition systems record data in micro SD cards. The data has been successfully acquired from both systems and has been used to monitor the status of the PV system and the local power grid. As more data is gathered it can be used for the maintenance and improvement of the photovoltaic system through analysis of the photovoltaic system’s parameters and usage statistics.
Analysis of stress corrosion data by means of the statistic of extreme values

International Nuclear Information System (INIS)

Imarisio, G.; Lanza, F.

1978-01-01

The possibility of examining stress corosion by means of extreme statistic was proposed. A series of test in boiling MgCl 2 of samples made on AISI 304 have been performed. Evolution of cracks dimension and time of life of samples was followed. It has been shown that the dimensions of the maximum cracks on the sample corroded for different times can be organized following the extreme values statistic. Also the life time of sample can be treated in the same way. A confirmation has been obtained using data taken from literature. Possible uses of predictions obtained with this type of analysis have been underlined. An extension of the toward less corrosive media and samples of several volumes is suggested to check the validity of the method
A New Statistical Method to Determine the Degree of Validity of Health Economic Model Outcomes against Empirical Data.

NARCIS (Netherlands)

Corro Ramos, Isaac; van Voorn, George A K; Vemer, Pepijn; Feenstra, Talitha L; Al, Maiwenn J

2017-01-01

The validation of health economic (HE) model outcomes against empirical data is of key importance. Although statistical testing seems applicable, guidelines for the validation of HE models lack guidance on statistical validation, and actual validation efforts often present subjective judgment of
Basic elements of computational statistics

CERN Document Server

Härdle, Wolfgang Karl; Okhrin, Yarema

2017-01-01

This textbook on computational statistics presents tools and concepts of univariate and multivariate statistical data analysis with a strong focus on applications and implementations in the statistical software R. It covers mathematical, statistical as well as programming problems in computational statistics and contains a wide variety of practical examples. In addition to the numerous R sniplets presented in the text, all computer programs (quantlets) and data sets to the book are available on GitHub and referred to in the book. This enables the reader to fully reproduce as well as modify and adjust all examples to their needs. The book is intended for advanced undergraduate and first-year graduate students as well as for data analysts new to the job who would like a tour of the various statistical tools in a data analysis workshop. The experienced reader with a good knowledge of statistics and programming might skip some sections on univariate models and enjoy the various mathematical roots of multivariate ...
Analysis of radiation monitoring data by distribution-free statistical methods (a case of river system Techa-Iset'-Tobol-Irtysh contamination)

International Nuclear Information System (INIS)

Luneva, K.V.; Kryshev, A.I.; Nikitin, A.I.; Kryshev, I.I.

2010-01-01

The article presents the results of statistical analysis of radiation monitoring data of river system Techa-Iset'-Tobol-Irtysh contamination. A short description of analyzable data and the territory under consideration was given. The distribution-free statistic methods, used for comparative analysis, were described. Reasons of the methods selection and their application features were given. Comparative data analysis with traditional statistics methods was presented. Reliable decrease of 90 Sr specific activity in the river system object to object was determined, which is the evidence of the radionuclide transportation in the river system Techa-Iset'-Tobol-Irtysh [ru
Statistical methods to detect novel genetic variants using publicly available GWAS summary data.

Science.gov (United States)

Guo, Bin; Wu, Baolin

2018-03-01

We propose statistical methods to detect novel genetic variants using only genome-wide association studies (GWAS) summary data without access to raw genotype and phenotype data. With more and more summary data being posted for public access in the post GWAS era, the proposed methods are practically very useful to identify additional interesting genetic variants and shed lights on the underlying disease mechanism. We illustrate the utility of our proposed methods with application to GWAS meta-analysis results of fasting glucose from the international MAGIC consortium. We found several novel genome-wide significant loci that are worth further study. Copyright © 2018 Elsevier Ltd. All rights reserved.
Criminal victimization in Ukraine: analysis of statistical data

Directory of Open Access Journals (Sweden)

Serhiy Nezhurbida

2007-12-01

Full Text Available The article is based on the analysis of statistical data provided by law-enforcement, judicial and other bodies of Ukraine. The given analysis allows us to give an accurate quantity of a current status of crime victimization in Ukraine, to characterize its basic features (level, rate, structure, dynamics, and etc.. L’article se concentre sur l’analyse des données statystiques fournies par les institutions de contrôle sociale (forces de police et magistrature et par d’autres organes institutionnels ukrainiens. Les analyses effectuées attirent l'attention sur la situation actuelle des victimes du crime en Ukraine et aident à délinéer leur principales caractéristiques (niveau, taux, structure, dynamiques, etc.L’articolo si basa sull’analisi dei dati statistici forniti dalle agenzie del controllo sociale (forze dell'ordine e magistratura e da altri organi istituzionali ucraini. Le analisi effettuate forniscono molte informazioni sulla situazione attuale delle vittime del crimine in Ucraina e aiutano a delinearne le caratteristiche principali (livello, tasso, struttura, dinamiche, ecc..
Computational statistics handbook with Matlab

CERN Document Server

Martinez, Wendy L

2007-01-01

Prefaces Introduction What Is Computational Statistics? An Overview of the Book Probability Concepts Introduction Probability Conditional Probability and Independence Expectation Common Distributions Sampling Concepts Introduction Sampling Terminology and Concepts Sampling Distributions Parameter Estimation Empirical Distribution Function Generating Random Variables Introduction General Techniques for Generating Random Variables Generating Continuous Random Variables Generating Discrete Random Variables Exploratory Data Analysis Introduction Exploring Univariate Data Exploring Bivariate and Trivariate Data Exploring Multidimensional Data Finding Structure Introduction Projecting Data Principal Component Analysis Projection Pursuit EDA Independent Component Analysis Grand Tour Nonlinear Dimensionality Reduction Monte Carlo Methods for Inferential Statistics Introduction Classical Inferential Statistics Monte Carlo Methods for Inferential Statist...
Statistical method evaluation for differentially methylated CpGs in base resolution next-generation DNA sequencing data.

Science.gov (United States)

Zhang, Yun; Baheti, Saurabh; Sun, Zhifu

2018-05-01

High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the count-based tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.
Statistical dynamic imaging of RI-labeled tracer from list-mode PET data

International Nuclear Information System (INIS)

Tanimoto, Michiaki; Kuroda, Yoshihiro; Oshiro, Osamu; Watabe, Hiroshi; Kuroda, Tomohiro

2009-01-01

Positron emission tomography (PET) can be used in physiological analysis to illustrate physiological states by visualizing the accumulation of radioisotope (RI)-labeled tracer in specific organs or tissues. PET obtains spatio-temporal statistics in the form of list-mode data. However, conventional imaging techniques, which sum up list-mode data over a given time period, cannot depict detailed temporal dynamics of the RI-labeled tracer. In this study, a spatio-temporal analysis approach was employed to visualize the temporal flow dynamics of RI-labeled tracer from the obtained list-mode data. Experiments to assess the visualization of simulated RI-labeled tracer dynamics as well as RI-labeled tracer dynamics in a vascular phantom showed that the proposed method successfully depicted detailed temporal flow dynamics that could not be visualized using conventional methods. (author)

Some links on this page may take you to non-federal websites. Their policies may differ from this site.