WorldWideScience

Sample records for statistical data

  1. Tuberculosis Data and Statistics

    Science.gov (United States)

    ... Advisory Groups Federal TB Task Force Data and Statistics Language: English (US) Español (Spanish) Recommend on Facebook ... Set) Mortality and Morbidity Weekly Reports Data and Statistics Decrease in Reported Tuberculosis Cases MMWR 2010; 59 ( ...

  2. School Violence: Data & Statistics

    Science.gov (United States)

    ... Social Media Publications Injury Center School Violence: Data & Statistics Recommend on Facebook Tweet Share Compartir The first ... Vehicle Safety Traumatic Brain Injury Injury Response Data & Statistics (WISQARS) Funded Programs Press Room Social Media Publications ...

  3. Statistical data analysis handbook

    National Research Council Canada - National Science Library

    Wall, Francis J

    1986-01-01

    It must be emphasized that this is not a text book on statistics. Instead it is a working tool that presents data analysis in clear, concise terms which can be readily understood even by those without formal training in statistics...

  4. Data and Statistics

    Science.gov (United States)

    ... About Us Information For… Media Policy Makers Data & Statistics Recommend on Facebook Tweet Share Compartir Sickle cell ... 1999 through 2002. This drop coincided with the introduction in 2000 of a vaccine that protects against ...

  5. Cancer Data and Statistics Tools

    Science.gov (United States)

    ... Educational Campaigns Initiatives Stay Informed Cancer Data and Statistics Tools Recommend on Facebook Tweet Share Compartir Cancer Statistics Tools United States Cancer Statistics: Data Visualizations The ...

  6. Hemophilia Data and Statistics

    Science.gov (United States)

    ... View public health webinars on blood disorders Data & Statistics Language: English (US) Español (Spanish) Recommend on Facebook ... genetic testing is done to diagnose hemophilia before birth. For the one-third ... rates and hospitalization rates for bleeding complications from hemophilia ...

  7. UN Data- Environmental Statistics: Waste

    Data.gov (United States)

    World Wide Human Geography Data Working Group — The Environment Statistics Database contains selected water and waste statistics by country. Statistics on water and waste are based on official statistics supplied...

  8. UN Data: Environment Statistics: Waste

    Data.gov (United States)

    World Wide Human Geography Data Working Group — The Environment Statistics Database contains selected water and waste statistics by country. Statistics on water and waste are based on official statistics supplied...

  9. Beginning statistics with data analysis

    CERN Document Server

    Mosteller, Frederick; Rourke, Robert EK

    2013-01-01

    This introduction to the world of statistics covers exploratory data analysis, methods for collecting data, formal statistical inference, and techniques of regression and analysis of variance. 1983 edition.

  10. Baseline Statistics of Linked Statistical Data

    NARCIS (Netherlands)

    Scharnhorst, Andrea; Meroño-Peñuela, Albert; Guéret, Christophe

    2014-01-01

    We are surrounded by an ever increasing ocean of information, everybody will agree to that. We build sophisticated strategies to govern this information: design data models, develop infrastructures for data sharing, building tool for data analysis. Statistical datasets curated by National

  11. Muscular Dystrophy: Data and Statistics

    Science.gov (United States)

    ... For… Media Policy Makers MD STARnet Data and Statistics Recommend on Facebook Tweet Share Compartir Expand All Collapse All The following data and statistics come from MD STARnet. Data from the MD ...

  12. Data and Statistics: Heart Failure

    Science.gov (United States)

    ... Summary Coverdell Program 2012-2015 State Summaries Data & Statistics Fact Sheets Heart Disease and Stroke Fact Sheets ... Roadmap for State Planning Other Data Resources Other Statistic Resources Grantee Information Cross-Program Information Online Tools ...

  13. Statistical data analysis

    International Nuclear Information System (INIS)

    Hahn, A.A.

    1994-11-01

    The complexity of instrumentation sometimes requires data analysis to be done before the result is presented to the control room. This tutorial reviews some of the theoretical assumptions underlying the more popular forms of data analysis and presents simple examples to illuminate the advantages and hazards of different techniques

  14. Spina Bifida Data and Statistics

    Science.gov (United States)

    ... Us Information For… Media Policy Makers Data and Statistics Recommend on Facebook Tweet Share Compartir Spina bifida ... the spine. Read below for the latest national statistics on spina bifida in the United States. In ...

  15. Birth Defects Data and Statistics

    Science.gov (United States)

    ... Submit" /> Information For… Media Policy Makers Data & Statistics Recommend on Facebook Tweet Share Compartir On This ... and critical. Read below for the latest national statistics on the occurrence of birth defects in the ...

  16. Energy statistical data. Europe

    International Nuclear Information System (INIS)

    2002-04-01

    This report summarizes in a series of tables the key energy data of 1999 for 9 European countries (Germany, Belgium, Denmark, Spain, France, Italy, Netherlands, UK, Sweden). Data concern: the energy intensity, the share of renewable energy sources in the total primary consumption, the structure of power production, the CO 2 emissions and their structure, and the end-use, primary consumption and energy prices per energy source. (J.S.)

  17. Statistical methods for ranking data

    CERN Document Server

    Alvo, Mayer

    2014-01-01

    This book introduces advanced undergraduate, graduate students and practitioners to statistical methods for ranking data. An important aspect of nonparametric statistics is oriented towards the use of ranking data. Rank correlation is defined through the notion of distance functions and the notion of compatibility is introduced to deal with incomplete data. Ranking data are also modeled using a variety of modern tools such as CART, MCMC, EM algorithm and factor analysis. This book deals with statistical methods used for analyzing such data and provides a novel and unifying approach for hypotheses testing. The techniques described in the book are illustrated with examples and the statistical software is provided on the authors’ website.

  18. Data Literacy is Statistical Literacy

    Science.gov (United States)

    Gould, Robert

    2017-01-01

    Past definitions of statistical literacy should be updated in order to account for the greatly amplified role that data now play in our lives. Experience working with high-school students in an innovative data science curriculum has shown that teaching statistical literacy, augmented by data literacy, can begin early.

  19. Statistical modeling for degradation data

    CERN Document Server

    Lio, Yuhlong; Ng, Hon; Tsai, Tzong-Ru

    2017-01-01

    This book focuses on the statistical aspects of the analysis of degradation data. In recent years, degradation data analysis has come to play an increasingly important role in different disciplines such as reliability, public health sciences, and finance. For example, information on products’ reliability can be obtained by analyzing degradation data. In addition, statistical modeling and inference techniques have been developed on the basis of different degradation measures. The book brings together experts engaged in statistical modeling and inference, presenting and discussing important recent advances in degradation data analysis and related applications. The topics covered are timely and have considerable potential to impact both statistics and reliability engineering.

  20. [Big data in official statistics].

    Science.gov (United States)

    Zwick, Markus

    2015-08-01

    The concept of "big data" stands to change the face of official statistics over the coming years, having an impact on almost all aspects of data production. The tasks of future statisticians will not necessarily be to produce new data, but rather to identify and make use of existing data to adequately describe social and economic phenomena. Until big data can be used correctly in official statistics, a lot of questions need to be answered and problems solved: the quality of data, data protection, privacy, and the sustainable availability are some of the more pressing issues to be addressed. The essential skills of official statisticians will undoubtedly change, and this implies a number of challenges to be faced by statistical education systems, in universities, and inside the statistical offices. The national statistical offices of the European Union have concluded a concrete strategy for exploring the possibilities of big data for official statistics, by means of the Big Data Roadmap and Action Plan 1.0. This is an important first step and will have a significant influence on implementing the concept of big data inside the statistical offices of Germany.

  1. Official statistics and Big Data

    Directory of Open Access Journals (Sweden)

    Peter Struijs

    2014-07-01

    Full Text Available The rise of Big Data changes the context in which organisations producing official statistics operate. Big Data provides opportunities, but in order to make optimal use of Big Data, a number of challenges have to be addressed. This stimulates increased collaboration between National Statistical Institutes, Big Data holders, businesses and universities. In time, this may lead to a shift in the role of statistical institutes in the provision of high-quality and impartial statistical information to society. In this paper, the changes in context, the opportunities, the challenges and the way to collaborate are addressed. The collaboration between the various stakeholders will involve each partner building on and contributing different strengths. For national statistical offices, traditional strengths include, on the one hand, the ability to collect data and combine data sources with statistical products and, on the other hand, their focus on quality, transparency and sound methodology. In the Big Data era of competing and multiplying data sources, they continue to have a unique knowledge of official statistical production methods. And their impartiality and respect for privacy as enshrined in law uniquely position them as a trusted third party. Based on this, they may advise on the quality and validity of information of various sources. By thus positioning themselves, they will be able to play their role as key information providers in a changing society.

  2. Statistical analysis and data management

    International Nuclear Information System (INIS)

    Anon.

    1981-01-01

    This report provides an overview of the history of the WIPP Biology Program. The recommendations of the American Institute of Biological Sciences (AIBS) for the WIPP biology program are summarized. The data sets available for statistical analyses and problems associated with these data sets are also summarized. Biological studies base maps are presented. A statistical model is presented to evaluate any correlation between climatological data and small mammal captures. No statistically significant relationship between variance in small mammal captures on Dr. Gennaro's 90m x 90m grid and precipitation records from the Duval Potash Mine were found

  3. Statistical Methods for Fuzzy Data

    CERN Document Server

    Viertl, Reinhard

    2011-01-01

    Statistical data are not always precise numbers, or vectors, or categories. Real data are frequently what is called fuzzy. Examples where this fuzziness is obvious are quality of life data, environmental, biological, medical, sociological and economics data. Also the results of measurements can be best described by using fuzzy numbers and fuzzy vectors respectively. Statistical analysis methods have to be adapted for the analysis of fuzzy data. In this book, the foundations of the description of fuzzy data are explained, including methods on how to obtain the characterizing function of fuzzy m

  4. Equivalent statistics and data interpretation.

    Science.gov (United States)

    Francis, Gregory

    2017-08-01

    Recent reform efforts in psychological science have led to a plethora of choices for scientists to analyze their data. A scientist making an inference about their data must now decide whether to report a p value, summarize the data with a standardized effect size and its confidence interval, report a Bayes Factor, or use other model comparison methods. To make good choices among these options, it is necessary for researchers to understand the characteristics of the various statistics used by the different analysis frameworks. Toward that end, this paper makes two contributions. First, it shows that for the case of a two-sample t test with known sample sizes, many different summary statistics are mathematically equivalent in the sense that they are based on the very same information in the data set. When the sample sizes are known, the p value provides as much information about a data set as the confidence interval of Cohen's d or a JZS Bayes factor. Second, this equivalence means that different analysis methods differ only in their interpretation of the empirical data. At first glance, it might seem that mathematical equivalence of the statistics suggests that it does not matter much which statistic is reported, but the opposite is true because the appropriateness of a reported statistic is relative to the inference it promotes. Accordingly, scientists should choose an analysis method appropriate for their scientific investigation. A direct comparison of the different inferential frameworks provides some guidance for scientists to make good choices and improve scientific practice.

  5. Statistical analysis of environmental data

    International Nuclear Information System (INIS)

    Beauchamp, J.J.; Bowman, K.O.; Miller, F.L. Jr.

    1975-10-01

    This report summarizes the analyses of data obtained by the Radiological Hygiene Branch of the Tennessee Valley Authority from samples taken around the Browns Ferry Nuclear Plant located in Northern Alabama. The data collection was begun in 1968 and a wide variety of types of samples have been gathered on a regular basis. The statistical analysis of environmental data involving very low-levels of radioactivity is discussed. Applications of computer calculations for data processing are described

  6. Statistically significant relational data mining :

    Energy Technology Data Exchange (ETDEWEB)

    Berry, Jonathan W.; Leung, Vitus Joseph; Phillips, Cynthia Ann; Pinar, Ali; Robinson, David Gerald; Berger-Wolf, Tanya; Bhowmick, Sanjukta; Casleton, Emily; Kaiser, Mark; Nordman, Daniel J.; Wilson, Alyson G.

    2014-02-01

    This report summarizes the work performed under the project (3z(BStatitically significant relational data mining.(3y (BThe goal of the project was to add more statistical rigor to the fairly ad hoc area of data mining on graphs. Our goal was to develop better algorithms and better ways to evaluate algorithm quality. We concetrated on algorithms for community detection, approximate pattern matching, and graph similarity measures. Approximate pattern matching involves finding an instance of a relatively small pattern, expressed with tolerance, in a large graph of data observed with uncertainty. This report gathers the abstracts and references for the eight refereed publications that have appeared as part of this work. We then archive three pieces of research that have not yet been published. The first is theoretical and experimental evidence that a popular statistical measure for comparison of community assignments favors over-resolved communities over approximations to a ground truth. The second are statistically motivated methods for measuring the quality of an approximate match of a small pattern in a large graph. The third is a new probabilistic random graph model. Statisticians favor these models for graph analysis. The new local structure graph model overcomes some of the issues with popular models such as exponential random graph models and latent variable models.

  7. Spatial Statistical Data Fusion (SSDF)

    Science.gov (United States)

    Braverman, Amy J.; Nguyen, Hai M.; Cressie, Noel

    2013-01-01

    As remote sensing for scientific purposes has transitioned from an experimental technology to an operational one, the selection of instruments has become more coordinated, so that the scientific community can exploit complementary measurements. However, tech nological and scientific heterogeneity across devices means that the statistical characteristics of the data they collect are different. The challenge addressed here is how to combine heterogeneous remote sensing data sets in a way that yields optimal statistical estimates of the underlying geophysical field, and provides rigorous uncertainty measures for those estimates. Different remote sensing data sets may have different spatial resolutions, different measurement error biases and variances, and other disparate characteristics. A state-of-the-art spatial statistical model was used to relate the true, but not directly observed, geophysical field to noisy, spatial aggregates observed by remote sensing instruments. The spatial covariances of the true field and the covariances of the true field with the observations were modeled. The observations are spatial averages of the true field values, over pixels, with different measurement noise superimposed. A kriging framework is used to infer optimal (minimum mean squared error and unbiased) estimates of the true field at point locations from pixel-level, noisy observations. A key feature of the spatial statistical model is the spatial mixed effects model that underlies it. The approach models the spatial covariance function of the underlying field using linear combinations of basis functions of fixed size. Approaches based on kriging require the inversion of very large spatial covariance matrices, and this is usually done by making simplifying assumptions about spatial covariance structure that simply do not hold for geophysical variables. In contrast, this method does not require these assumptions, and is also computationally much faster. This method is

  8. Statistical processing of experimental data

    OpenAIRE

    NAVRÁTIL, Pavel

    2012-01-01

    This thesis contains theory of probability and statistical sets. Solved and unsolved problems of probability, random variable and distributions random variable, random vector, statistical sets, regression and correlation analysis. Unsolved problems contains solutions.

  9. Does environmental data collection need statistics?

    NARCIS (Netherlands)

    Pulles, M.P.J.

    1998-01-01

    The term 'statistics' with reference to environmental science and policymaking might mean different things: the development of statistical methodology, the methodology developed by statisticians to interpret and analyse such data, or the statistical data that are needed to understand environmental

  10. Statistical analysis of management data

    CERN Document Server

    Gatignon, Hubert

    2013-01-01

    This book offers a comprehensive approach to multivariate statistical analyses. It provides theoretical knowledge of the concepts underlying the most important multivariate techniques and an overview of actual applications.

  11. Statistical interpretation of geochemical data

    International Nuclear Information System (INIS)

    Carambula, M.

    1990-01-01

    Statistical results have been obtained from a geochemical research from the following four aerial photographies Zapican, Carape, Las Canias, Alferez. They have been studied 3020 samples in total, to 22 chemical elements using plasma emission spectrometry methods.

  12. Statistical data analysis using SAS intermediate statistical methods

    CERN Document Server

    Marasinghe, Mervyn G

    2018-01-01

    The aim of this textbook (previously titled SAS for Data Analytics) is to teach the use of SAS for statistical analysis of data for advanced undergraduate and graduate students in statistics, data science, and disciplines involving analyzing data. The book begins with an introduction beyond the basics of SAS, illustrated with non-trivial, real-world, worked examples. It proceeds to SAS programming and applications, SAS graphics, statistical analysis of regression models, analysis of variance models, analysis of variance with random and mixed effects models, and then takes the discussion beyond regression and analysis of variance to conclude. Pedagogically, the authors introduce theory and methodological basis topic by topic, present a problem as an application, followed by a SAS analysis of the data provided and a discussion of results. The text focuses on applied statistical problems and methods. Key features include: end of chapter exercises, downloadable SAS code and data sets, and advanced material suitab...

  13. Data and Statistics: Women and Heart Disease

    Science.gov (United States)

    ... Summary Coverdell Program 2012-2015 State Summaries Data & Statistics Fact Sheets Heart Disease and Stroke Fact Sheets ... Roadmap for State Planning Other Data Resources Other Statistic Resources Grantee Information Cross-Program Information Online Tools ...

  14. DATA ON YOUTH, 1967, A STATISTICAL DOCUMENT.

    Science.gov (United States)

    SCHEIDER, GEORGE

    THE DATA IN THIS REPORT ARE STATISTICS ON YOUTH THROUGHOUT THE UNITED STATES AND IN NEW YORK STATE. INCLUDED ARE DATA ON POPULATION, SCHOOL STATISTICS, EMPLOYMENT, FAMILY INCOME, JUVENILE DELINQUENCY AND YOUTH CRIME (INCLUDING NEW YORK CITY FIGURES), AND TRAFFIC ACCIDENTS. THE STATISTICS ARE PRESENTED IN THE TEXT AND IN TABLES AND CHARTS. (NH)

  15. Advanced statistical methods in data science

    CERN Document Server

    Chen, Jiahua; Lu, Xuewen; Yi, Grace; Yu, Hao

    2016-01-01

    This book gathers invited presentations from the 2nd Symposium of the ICSA- CANADA Chapter held at the University of Calgary from August 4-6, 2015. The aim of this Symposium was to promote advanced statistical methods in big-data sciences and to allow researchers to exchange ideas on statistics and data science and to embraces the challenges and opportunities of statistics and data science in the modern world. It addresses diverse themes in advanced statistical analysis in big-data sciences, including methods for administrative data analysis, survival data analysis, missing data analysis, high-dimensional and genetic data analysis, longitudinal and functional data analysis, the design and analysis of studies with response-dependent and multi-phase designs, time series and robust statistics, statistical inference based on likelihood, empirical likelihood and estimating functions. The editorial group selected 14 high-quality presentations from this successful symposium and invited the presenters to prepare a fu...

  16. Statistical methods for astronomical data analysis

    CERN Document Server

    Chattopadhyay, Asis Kumar

    2014-01-01

    This book introduces “Astrostatistics” as a subject in its own right with rewarding examples, including work by the authors with galaxy and Gamma Ray Burst data to engage the reader. This includes a comprehensive blending of Astrophysics and Statistics. The first chapter’s coverage of preliminary concepts and terminologies for astronomical phenomenon will appeal to both Statistics and Astrophysics readers as helpful context. Statistics concepts covered in the book provide a methodological framework. A unique feature is the inclusion of different possible sources of astronomical data, as well as software packages for converting the raw data into appropriate forms for data analysis. Readers can then use the appropriate statistical packages for their particular data analysis needs. The ideas of statistical inference discussed in the book help readers determine how to apply statistical tests. The authors cover different applications of statistical techniques already developed or specifically introduced for ...

  17. Statistical Models and Methods for Lifetime Data

    CERN Document Server

    Lawless, Jerald F

    2011-01-01

    Praise for the First Edition"An indispensable addition to any serious collection on lifetime data analysis and . . . a valuable contribution to the statistical literature. Highly recommended . . ."-Choice"This is an important book, which will appeal to statisticians working on survival analysis problems."-Biometrics"A thorough, unified treatment of statistical models and methods used in the analysis of lifetime data . . . this is a highly competent and agreeable statistical textbook."-Statistics in MedicineThe statistical analysis of lifetime or response time data is a key tool in engineering,

  18. Powerful Statistical Inference for Nested Data Using Sufficient Summary Statistics

    Science.gov (United States)

    Dowding, Irene; Haufe, Stefan

    2018-01-01

    Hierarchically-organized data arise naturally in many psychology and neuroscience studies. As the standard assumption of independent and identically distributed samples does not hold for such data, two important problems are to accurately estimate group-level effect sizes, and to obtain powerful statistical tests against group-level null hypotheses. A common approach is to summarize subject-level data by a single quantity per subject, which is often the mean or the difference between class means, and treat these as samples in a group-level t-test. This “naive” approach is, however, suboptimal in terms of statistical power, as it ignores information about the intra-subject variance. To address this issue, we review several approaches to deal with nested data, with a focus on methods that are easy to implement. With what we call the sufficient-summary-statistic approach, we highlight a computationally efficient technique that can improve statistical power by taking into account within-subject variances, and we provide step-by-step instructions on how to apply this approach to a number of frequently-used measures of effect size. The properties of the reviewed approaches and the potential benefits over a group-level t-test are quantitatively assessed on simulated data and demonstrated on EEG data from a simulated-driving experiment. PMID:29615885

  19. Statistical Literacy: Data Tell a Story

    Science.gov (United States)

    Sole, Marla A.

    2016-01-01

    Every day, students collect, organize, and analyze data to make decisions. In this data-driven world, people need to assess how much trust they can place in summary statistics. The results of every survey and the safety of every drug that undergoes a clinical trial depend on the correct application of appropriate statistics. Recognizing the…

  20. Tourette Syndrome (TS): Data and Statistics

    Science.gov (United States)

    ... Submit" /> Information For… Media Policy Makers Data & Statistics Recommend on Facebook Tweet Share Compartir * The data ... Behavioral or conduct problems, 26%; Anxiety problems, 49%; Depression, 25%; Autism spectrum disorder, 35%; Learning disability, 47%; ...

  1. Collecting operational event data for statistical analysis

    International Nuclear Information System (INIS)

    Atwood, C.L.

    1994-09-01

    This report gives guidance for collecting operational data to be used for statistical analysis, especially analysis of event counts. It discusses how to define the purpose of the study, the unit (system, component, etc.) to be studied, events to be counted, and demand or exposure time. Examples are given of classification systems for events in the data sources. A checklist summarizes the essential steps in data collection for statistical analysis

  2. Statistical data filtration in neutron coincidence counting

    International Nuclear Information System (INIS)

    Beddingfield, D.H.; Menlove, H.O.

    1992-11-01

    We assessed the effectiveness of statistical data filtration to minimize the contribution of matrix materials in 200-ell drums to the nondestructive assay of plutonium. Those matrices were examined: polyethylene, concrete, aluminum, iron, cadmium, and lead. Statistical filtration of neutron coincidence data improved the low-end sensitivity of coincidence counters. Spurious data arising from electrical noise, matrix spallation, and geometric effects were smoothed in a predictable fashion by the statistical filter. The filter effectively lowers the minimum detectable mass limit that can be achieved for plutonium assay using passive neutron coincidence counting

  3. Statistical data fusion for cross-tabulation

    NARCIS (Netherlands)

    Kamakura, W.A.; Wedel, M.

    The authors address the situation in which a researcher wants to cross-tabulate two sets of discrete variables collected in independent samples, but a subset of the variables is common to both samples. The authors propose a statistical data-fusion model that allows for statistical tests of

  4. Statistical Literacy in the Data Science Workplace

    Science.gov (United States)

    Grant, Robert

    2017-01-01

    Statistical literacy, the ability to understand and make use of statistical information including methods, has particular relevance in the age of data science, when complex analyses are undertaken by teams from diverse backgrounds. Not only is it essential to communicate to the consumers of information but also within the team. Writing from the…

  5. Workshop statistics discovery with data and Minitab

    CERN Document Server

    Rossman, Allan J

    1998-01-01

    Shorn of all subtlety and led naked out of the protec­ tive fold of educational research literature, there comes a sheepish little fact: lectures don't work nearly as well as many of us would like to think. -George Cobb (1992) This book contains activities that guide students to discover statistical concepts, explore statistical principles, and apply statistical techniques. Students work toward these goals through the analysis of genuine data and through inter­ action with one another, with their instructor, and with technology. Providing a one-semester introduction to fundamental ideas of statistics for college and advanced high school students, Warkshop Statistics is designed for courses that employ an interactive learning environment by replacing lectures with hands­ on activities. The text contains enough expository material to stand alone, but it can also be used to supplement a more traditional textbook. Some distinguishing features of Workshop Statistics are its emphases on active learning, conceptu...

  6. Topology for statistical modeling of petascale data.

    Energy Technology Data Exchange (ETDEWEB)

    Pascucci, Valerio (University of Utah, Salt Lake City, UT); Mascarenhas, Ajith Arthur; Rusek, Korben (Texas A& M University, College Station, TX); Bennett, Janine Camille; Levine, Joshua (University of Utah, Salt Lake City, UT); Pebay, Philippe Pierre; Gyulassy, Attila (University of Utah, Salt Lake City, UT); Thompson, David C.; Rojas, Joseph Maurice (Texas A& M University, College Station, TX)

    2011-07-01

    This document presents current technical progress and dissemination of results for the Mathematics for Analysis of Petascale Data (MAPD) project titled 'Topology for Statistical Modeling of Petascale Data', funded by the Office of Science Advanced Scientific Computing Research (ASCR) Applied Math program. Many commonly used algorithms for mathematical analysis do not scale well enough to accommodate the size or complexity of petascale data produced by computational simulations. The primary goal of this project is thus to develop new mathematical tools that address both the petascale size and uncertain nature of current data. At a high level, our approach is based on the complementary techniques of combinatorial topology and statistical modeling. In particular, we use combinatorial topology to filter out spurious data that would otherwise skew statistical modeling techniques, and we employ advanced algorithms from algebraic statistics to efficiently find globally optimal fits to statistical models. This document summarizes the technical advances we have made to date that were made possible in whole or in part by MAPD funding. These technical contributions can be divided loosely into three categories: (1) advances in the field of combinatorial topology, (2) advances in statistical modeling, and (3) new integrated topological and statistical methods.

  7. Classification, (big) data analysis and statistical learning

    CERN Document Server

    Conversano, Claudio; Vichi, Maurizio

    2018-01-01

    This edited book focuses on the latest developments in classification, statistical learning, data analysis and related areas of data science, including statistical analysis of large datasets, big data analytics, time series clustering, integration of data from different sources, as well as social networks. It covers both methodological aspects as well as applications to a wide range of areas such as economics, marketing, education, social sciences, medicine, environmental sciences and the pharmaceutical industry. In addition, it describes the basic features of the software behind the data analysis results, and provides links to the corresponding codes and data sets where necessary. This book is intended for researchers and practitioners who are interested in the latest developments and applications in the field. The peer-reviewed contributions were presented at the 10th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, held in Santa Margherita di Pul...

  8. Hearing Loss in Children: Data and Statistics

    Science.gov (United States)

    ... 5 Chapter 6 EHDI-IS Functional Standards EHDI Electronic Health Records EHDI Data Analysis and Statistical Hub (DASH) Articles & ... RSS ABOUT About CDC Jobs Funding LEGAL Policies Privacy FOIA No Fear Act OIG 1600 Clifton Road ...

  9. Challenges in computational statistics and data mining

    CERN Document Server

    Mielniczuk, Jan

    2016-01-01

    This volume contains nineteen research papers belonging to the areas of computational statistics, data mining, and their applications. Those papers, all written specifically for this volume, are their authors’ contributions to honour and celebrate Professor Jacek Koronacki on the occcasion of his 70th birthday. The book’s related and often interconnected topics, represent Jacek Koronacki’s research interests and their evolution. They also clearly indicate how close the areas of computational statistics and data mining are.

  10. Advances in statistical models for data analysis

    CERN Document Server

    Minerva, Tommaso; Vichi, Maurizio

    2015-01-01

    This edited volume focuses on recent research results in classification, multivariate statistics and machine learning and highlights advances in statistical models for data analysis. The volume provides both methodological developments and contributions to a wide range of application areas such as economics, marketing, education, social sciences and environment. The papers in this volume were first presented at the 9th biannual meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, held in September 2013 at the University of Modena and Reggio Emilia, Italy.

  11. Statistical treatment of fatigue test data

    International Nuclear Information System (INIS)

    Raske, D.T.

    1980-01-01

    This report discussed several aspects of fatigue data analysis in order to provide a basis for the development of statistically sound design curves. Included is a discussion on the choice of the dependent variable, the assumptions associated with least squares regression models, the variability of fatigue data, the treatment of data from suspended tests and outlying observations, and various strain-life relations

  12. Statistical analysis of next generation sequencing data

    CERN Document Server

    Nettleton, Dan

    2014-01-01

    Next Generation Sequencing (NGS) is the latest high throughput technology to revolutionize genomic research. NGS generates massive genomic datasets that play a key role in the big data phenomenon that surrounds us today. To extract signals from high-dimensional NGS data and make valid statistical inferences and predictions, novel data analytic and statistical techniques are needed. This book contains 20 chapters written by prominent statisticians working with NGS data. The topics range from basic preprocessing and analysis with NGS data to more complex genomic applications such as copy number variation and isoform expression detection. Research statisticians who want to learn about this growing and exciting area will find this book useful. In addition, many chapters from this book could be included in graduate-level classes in statistical bioinformatics for training future biostatisticians who will be expected to deal with genomic data in basic biomedical research, genomic clinical trials and personalized med...

  13. Statistical Inference for Data Adaptive Target Parameters.

    Science.gov (United States)

    Hubbard, Alan E; Kherad-Pajouh, Sara; van der Laan, Mark J

    2016-05-01

    Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming "data-driven", the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.

  14. Statistics and analysis of scientific data

    CERN Document Server

    Bonamente, Massimiliano

    2013-01-01

    Statistics and Analysis of Scientific Data covers the foundations of probability theory and statistics, and a number of numerical and analytical methods that are essential for the present-day analyst of scientific data. Topics covered include probability theory, distribution functions of statistics, fits to two-dimensional datasheets and parameter estimation, Monte Carlo methods and Markov chains. Equal attention is paid to the theory and its practical application, and results from classic experiments in various fields are used to illustrate the importance of statistics in the analysis of scientific data. The main pedagogical method is a theory-then-application approach, where emphasis is placed first on a sound understanding of the underlying theory of a topic, which becomes the basis for an efficient and proactive use of the material for practical applications. The level is appropriate for undergraduates and beginning graduate students, and as a reference for the experienced researcher. Basic calculus is us...

  15. Statistics and Data Interpretation for Social Work

    CERN Document Server

    Rosenthal, James

    2011-01-01

    "Without question, this text will be the most authoritative source of information on statistics in the human services. From my point of view, it is a definitive work that combines a rigorous pedagogy with a down to earth (commonsense) exploration of the complex and difficult issues in data analysis (statistics) and interpretation. I welcome its publication.". -Praise for the First Edition. Written by a social worker for social work students, this is a nuts and bolts guide to statistics that presents complex calculations and concepts in clear, easy-to-understand language. It includes

  16. Statistical Methods for Unusual Count Data

    DEFF Research Database (Denmark)

    Guthrie, Katherine A.; Gammill, Hilary S.; Kamper-Jørgensen, Mads

    2016-01-01

    microchimerism data present challenges for statistical analysis, including a skewed distribution, excess zero values, and occasional large values. Methods for comparing microchimerism levels across groups while controlling for covariates are not well established. We compared statistical models for quantitative...... microchimerism values, applied to simulated data sets and 2 observed data sets, to make recommendations for analytic practice. Modeling the level of quantitative microchimerism as a rate via Poisson or negative binomial model with the rate of detection defined as a count of microchimerism genome equivalents per...

  17. Topology for Statistical Modeling of Petascale Data

    Energy Technology Data Exchange (ETDEWEB)

    Bennett, Janine Camille [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Pebay, Philippe Pierre [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Pascucci, Valerio [Univ. of Utah, Salt Lake City, UT (United States); Levine, Joshua [Univ. of Utah, Salt Lake City, UT (United States); Gyulassy, Attila [Univ. of Utah, Salt Lake City, UT (United States); Rojas, Maurice [Texas A & M Univ., College Station, TX (United States)

    2014-07-01

    This document presents current technical progress and dissemination of results for the Mathematics for Analysis of Petascale Data (MAPD) project titled "Topology for Statistical Modeling of Petascale Data", funded by the Office of Science Advanced Scientific Computing Research (ASCR) Applied Math program.

  18. A Statistical Toolkit for Data Analysis

    International Nuclear Information System (INIS)

    Donadio, S.; Guatelli, S.; Mascialino, B.; Pfeiffer, A.; Pia, M.G.; Ribon, A.; Viarengo, P.

    2006-01-01

    The present project aims to develop an open-source and object-oriented software Toolkit for statistical data analysis. Its statistical testing component contains a variety of Goodness-of-Fit tests, from Chi-squared to Kolmogorov-Smirnov, to less known, but generally much more powerful tests such as Anderson-Darling, Goodman, Fisz-Cramer-von Mises, Kuiper, Tiku. Thanks to the component-based design and the usage of the standard abstract interfaces for data analysis, this tool can be used by other data analysis systems or integrated in experimental software frameworks. This Toolkit has been released and is downloadable from the web. In this paper we describe the statistical details of the algorithms, the computational features of the Toolkit and describe the code validation

  19. Statistical Analysis of Big Data on Pharmacogenomics

    Science.gov (United States)

    Fan, Jianqing; Liu, Han

    2013-01-01

    This paper discusses statistical methods for estimating complex correlation structure from large pharmacogenomic datasets. We selectively review several prominent statistical methods for estimating large covariance matrix for understanding correlation structure, inverse covariance matrix for network modeling, large-scale simultaneous tests for selecting significantly differently expressed genes and proteins and genetic markers for complex diseases, and high dimensional variable selection for identifying important molecules for understanding molecule mechanisms in pharmacogenomics. Their applications to gene network estimation and biomarker selection are used to illustrate the methodological power. Several new challenges of Big data analysis, including complex data distribution, missing data, measurement error, spurious correlation, endogeneity, and the need for robust statistical methods, are also discussed. PMID:23602905

  20. Statistics and analysis of scientific data

    CERN Document Server

    Bonamente, Massimiliano

    2017-01-01

    The revised second edition of this textbook provides the reader with a solid foundation in probability theory and statistics as applied to the physical sciences, engineering and related fields. It covers a broad range of numerical and analytical methods that are essential for the correct analysis of scientific data, including probability theory, distribution functions of statistics, fits to two-dimensional data and parameter estimation, Monte Carlo methods and Markov chains. Features new to this edition include: • a discussion of statistical techniques employed in business science, such as multiple regression analysis of multivariate datasets. • a new chapter on the various measures of the mean including logarithmic averages. • new chapters on systematic errors and intrinsic scatter, and on the fitting of data with bivariate errors. • a new case study and additional worked examples. • mathematical derivations and theoretical background material have been appropriately marked,to improve the readabili...

  1. Topology for Statistical Modeling of Petascale Data

    Energy Technology Data Exchange (ETDEWEB)

    Pascucci, Valerio [Univ. of Utah, Salt Lake City, UT (United States); Levine, Joshua [Univ. of Utah, Salt Lake City, UT (United States); Gyulassy, Attila [Univ. of Utah, Salt Lake City, UT (United States); Bremer, P. -T. [Univ. of Utah, Salt Lake City, UT (United States)

    2013-10-31

    Many commonly used algorithms for mathematical analysis do not scale well enough to accommodate the size or complexity of petascale data produced by computational simulations. The primary goal of this project is to develop new mathematical tools that address both the petascale size and uncertain nature of current data. At a high level, the approach of the entire team involving all three institutions is based on the complementary techniques of combinatorial topology and statistical modelling. In particular, we use combinatorial topology to filter out spurious data that would otherwise skew statistical modelling techniques, and we employ advanced algorithms from algebraic statistics to efficiently find globally optimal fits to statistical models. The overall technical contributions can be divided loosely into three categories: (1) advances in the field of combinatorial topology, (2) advances in statistical modelling, and (3) new integrated topological and statistical methods. Roughly speaking, the division of labor between our 3 groups (Sandia Labs in Livermore, Texas A&M in College Station, and U Utah in Salt Lake City) is as follows: the Sandia group focuses on statistical methods and their formulation in algebraic terms, and finds the application problems (and data sets) most relevant to this project, the Texas A&M Group develops new algebraic geometry algorithms, in particular with fewnomial theory, and the Utah group develops new algorithms in computational topology via Discrete Morse Theory. However, we hasten to point out that our three groups stay in tight contact via videconference every 2 weeks, so there is much synergy of ideas between the groups. The following of this document is focused on the contributions that had grater direct involvement from the team at the University of Utah in Salt Lake City.

  2. Gas, electricity, coal: 1998 statistical data

    International Nuclear Information System (INIS)

    1999-01-01

    This document brings together the main statistical data from the French direction of gas, electricity and coal and presents a selection of the most significant numbered data: origin of production, share of the consumption, price levels, resources-employment status. These data are presented in a synthetic and accessible way in order to make useful references for the actors of the energy sector. (J.S.)

  3. Register-based statistics statistical methods for administrative data

    CERN Document Server

    Wallgren, Anders

    2014-01-01

    This book provides a comprehensive and up to date treatment of  theory and practical implementation in Register-based statistics. It begins by defining the area, before explaining how to structure such systems, as well as detailing alternative approaches. It explains how to create statistical registers, how to implement quality assurance, and the use of IT systems for register-based statistics. Further to this, clear details are given about the practicalities of implementing such statistical methods, such as protection of privacy and the coordination and coherence of such an undertaking. Thi

  4. Statistical processing of technological and radiochemical data

    International Nuclear Information System (INIS)

    Lahodova, Zdena; Vonkova, Kateřina

    2011-01-01

    The project described in this article had two goals. The main goal was to compare technological and radiochemical data from two units of nuclear power plant. The other goal was to check the collection, organization and interpretation of routinely measured data. Monitoring of analytical and radiochemical data is a very valuable source of knowledge for some processes in the primary circuit. Exploratory analysis of one-dimensional data was performed to estimate location and variability and to find extreme values, data trends, distribution, autocorrelation etc. This process allowed for the cleaning and completion of raw data. Then multiple analyses such as multiple comparisons, multiple correlation, variance analysis, and so on were performed. Measured data was organized into a data matrix. The results and graphs such as Box plots, Mahalanobis distance, Biplot, Correlation, and Trend graphs are presented in this article as statistical analysis tools. Tables of data were replaced with graphs because graphs condense large amounts of information into easy-to-understand formats. The significant conclusion of this work is that the collection and comprehension of data is a very substantial part of statistical processing. With well-prepared and well-understood data, its accurate evaluation is possible. Cooperation between the technicians who collect data and the statistician who processes it is also very important. (author)

  5. Performing Inferential Statistics Prior to Data Collection

    Science.gov (United States)

    Trafimow, David; MacDonald, Justin A.

    2017-01-01

    Typically, in education and psychology research, the investigator collects data and subsequently performs descriptive and inferential statistics. For example, a researcher might compute group means and use the null hypothesis significance testing procedure to draw conclusions about the populations from which the groups were drawn. We propose an…

  6. Statistical Data Editing in Scientific Articles.

    Science.gov (United States)

    Habibzadeh, Farrokh

    2017-07-01

    Scientific journals are important scholarly forums for sharing research findings. Editors have important roles in safeguarding standards of scientific publication and should be familiar with correct presentation of results, among other core competencies. Editors do not have access to the raw data and should thus rely on clues in the submitted manuscripts. To identify probable errors, they should look for inconsistencies in presented results. Common statistical problems that can be picked up by a knowledgeable manuscript editor are discussed in this article. Manuscripts should contain a detailed section on statistical analyses of the data. Numbers should be reported with appropriate precisions. Standard error of the mean (SEM) should not be reported as an index of data dispersion. Mean (standard deviation [SD]) and median (interquartile range [IQR]) should be used for description of normally and non-normally distributed data, respectively. If possible, it is better to report 95% confidence interval (CI) for statistics, at least for main outcome variables. And, P values should be presented, and interpreted with caution, if there is a hypothesis. To advance knowledge and skills of their members, associations of journal editors are better to develop training courses on basic statistics and research methodology for non-experts. This would in turn improve research reporting and safeguard the body of scientific evidence. © 2017 The Korean Academy of Medical Sciences.

  7. Vapor Pressure Data Analysis and Statistics

    Science.gov (United States)

    2016-12-01

    near 8, 2000, and 200, respectively. The A (or a) value is directly related to vapor pressure and will be greater for high vapor pressure materials...1, (10) where n is the number of data points, Yi is the natural logarithm of the i th experimental vapor pressure value, and Xi is the...VAPOR PRESSURE DATA ANALYSIS AND STATISTICS ECBC-TR-1422 Ann Brozena RESEARCH AND TECHNOLOGY DIRECTORATE

  8. Data Mining and Statistics for Decision Making

    CERN Document Server

    Tufféry, Stéphane

    2011-01-01

    Data mining is the process of automatically searching large volumes of data for models and patterns using computational techniques from statistics, machine learning and information theory; it is the ideal tool for such an extraction of knowledge. Data mining is usually associated with a business or an organization's need to identify trends and profiles, allowing, for example, retailers to discover patterns on which to base marketing objectives. This book looks at both classical and recent techniques of data mining, such as clustering, discriminant analysis, logistic regression, generalized lin

  9. Statistical Analysis of Data for Timber Strengths

    DEFF Research Database (Denmark)

    Sørensen, John Dalsgaard

    2003-01-01

    Statistical analyses are performed for material strength parameters from a large number of specimens of structural timber. Non-parametric statistical analysis and fits have been investigated for the following distribution types: Normal, Lognormal, 2 parameter Weibull and 3-parameter Weibull...... fits to the data available, especially if tail fits are used whereas the Log Normal distribution generally gives a poor fit and larger coefficients of variation, especially if tail fits are used. The implications on the reliability level of typical structural elements and on partial safety factors...... for timber are investigated....

  10. Dimensional enrichment of statistical linked open data

    DEFF Research Database (Denmark)

    Varga, Jovan; Vaisman, Alejandro; Romero, Oscar

    2016-01-01

    On-Line Analytical Processing (OLAP) is a data analysis technique typically used for local and well-prepared data. However, initiatives like Open Data and Open Government bring new and publicly available data on the web that are to be analyzed in the same way. The use of semantic web technologies...... for this context is especially encouraged by the Linked Data initiative. There is already a considerable amount of statistical linked open data sets published using the RDF Data Cube Vocabulary (QB) which is designed for these purposes. However, QB lacks some essential schema constructs (e.g., dimension levels......) to support OLAP. Thus, the QB4OLAP vocabulary has been proposed to extend QB with the necessary constructs and be fully compliant with OLAP. In this paper, we focus on the enrichment of an existing QB data set with QB4OLAP semantics. We first thoroughly compare the two vocabularies and outline the benefits...

  11. Uncertainty analysis with statistically correlated failure data

    International Nuclear Information System (INIS)

    Modarres, M.; Dezfuli, H.; Roush, M.L.

    1987-01-01

    Likelihood of occurrence of the top event of a fault tree or sequences of an event tree is estimated from the failure probability of components that constitute the events of the fault/event tree. Component failure probabilities are subject to statistical uncertainties. In addition, there are cases where the failure data are statistically correlated. At present most fault tree calculations are based on uncorrelated component failure data. This chapter describes a methodology for assessing the probability intervals for the top event failure probability of fault trees or frequency of occurrence of event tree sequences when event failure data are statistically correlated. To estimate mean and variance of the top event, a second-order system moment method is presented through Taylor series expansion, which provides an alternative to the normally used Monte Carlo method. For cases where component failure probabilities are statistically correlated, the Taylor expansion terms are treated properly. Moment matching technique is used to obtain the probability distribution function of the top event through fitting the Johnson Ssub(B) distribution. The computer program, CORRELATE, was developed to perform the calculations necessary for the implementation of the method developed. (author)

  12. Robust statistics and geochemical data analysis

    International Nuclear Information System (INIS)

    Di, Z.

    1987-01-01

    Advantages of robust procedures over ordinary least-squares procedures in geochemical data analysis is demonstrated using NURE data from the Hot Springs Quadrangle, South Dakota, USA. Robust principal components analysis with 5% multivariate trimming successfully guarded the analysis against perturbations by outliers and increased the number of interpretable factors. Regression with SINE estimates significantly increased the goodness-of-fit of the regression and improved the correspondence of delineated anomalies with known uranium prospects. Because of the ubiquitous existence of outliers in geochemical data, robust statistical procedures are suggested as routine procedures to replace ordinary least-squares procedures

  13. Critical analysis of adsorption data statistically

    Science.gov (United States)

    Kaushal, Achla; Singh, S. K.

    2017-10-01

    Experimental data can be presented, computed, and critically analysed in a different way using statistics. A variety of statistical tests are used to make decisions about the significance and validity of the experimental data. In the present study, adsorption was carried out to remove zinc ions from contaminated aqueous solution using mango leaf powder. The experimental data was analysed statistically by hypothesis testing applying t test, paired t test and Chi-square test to (a) test the optimum value of the process pH, (b) verify the success of experiment and (c) study the effect of adsorbent dose in zinc ion removal from aqueous solutions. Comparison of calculated and tabulated values of t and χ 2 showed the results in favour of the data collected from the experiment and this has been shown on probability charts. K value for Langmuir isotherm was 0.8582 and m value for Freundlich adsorption isotherm obtained was 0.725, both are mango leaf powder.

  14. Statistical analysis of network data with R

    CERN Document Server

    Kolaczyk, Eric D

    2014-01-01

    Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing. As such, network analysis is an important growth area in the quantitative sciences, with roots in social network analysis going back to the 1930s and graph theory going back centuries. Measurement and analysis are integral components of network research. As a result, statistical methods play a critical role in network analysis. This book is the first of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to conduct a wide range of network analyses, from basic manipulation and visualization, to summary and characterization, to modeling of network data. The central package is igraph, which provides extensive capabilities for studying network graphs in R. This text builds on Eric D. Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).

  15. Statistical analysis of dragline monitoring data

    Energy Technology Data Exchange (ETDEWEB)

    Mirabediny, H.; Baafi, E.Y. [University of Tehran, Tehran (Iran)

    1998-07-01

    Dragline monitoring systems are normally the best tool used to collect data on the machine performance and operational parameters of a dragline operation. This paper discusses results of a time study using data from a dragline monitoring system captured over a four month period. Statistical summaries of the time study in terms of average values, standard deviation and frequency distributions showed that the mode of operation and the geological conditions have a significant influence on the dragline performance parameters. 6 refs., 14 figs., 3 tabs.

  16. Innovative statistical methods for public health data

    CERN Document Server

    Wilson, Jeffrey

    2015-01-01

    The book brings together experts working in public health and multi-disciplinary areas to present recent issues in statistical methodological development and their applications. This timely book will impact model development and data analyses of public health research across a wide spectrum of analysis. Data and software used in the studies are available for the reader to replicate the models and outcomes. The fifteen chapters range in focus from techniques for dealing with missing data with Bayesian estimation, health surveillance and population definition and implications in applied latent class analysis, to multiple comparison and meta-analysis in public health data. Researchers in biomedical and public health research will find this book to be a useful reference, and it can be used in graduate level classes.

  17. Common misconceptions about data analysis and statistics.

    Science.gov (United States)

    Motulsky, Harvey J

    2014-11-01

    Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason maybe that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: 1. P-Hacking. This is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want. 2. Overemphasis on P values rather than on the actual size of the observed effect. 3. Overuse of statistical hypothesis testing, and being seduced by the word "significant". 4. Overreliance on standard errors, which are often misunderstood.

  18. Common misconceptions about data analysis and statistics.

    Science.gov (United States)

    Motulsky, Harvey J

    2015-02-01

    Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason may be that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: (1) P-Hacking. This is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want. (2) Overemphasis on P values rather than on the actual size of the observed effect. (3) Overuse of statistical hypothesis testing, and being seduced by the word "significant". (4) Overreliance on standard errors, which are often misunderstood.

  19. Statistical methods and computing for big data

    Science.gov (United States)

    Wang, Chun; Chen, Ming-Hui; Schifano, Elizabeth; Wu, Jing

    2016-01-01

    Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay. PMID:27695593

  20. Statistical methods and computing for big data.

    Science.gov (United States)

    Wang, Chun; Chen, Ming-Hui; Schifano, Elizabeth; Wu, Jing; Yan, Jun

    2016-01-01

    Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.

  1. Analysis of Preference Data Using Intermediate Test Statistic Abstract

    African Journals Online (AJOL)

    PROF. O. E. OSUAGWU

    2013-06-01

    Jun 1, 2013 ... West African Journal of Industrial and Academic Research Vol.7 No. 1 June ... Keywords:-Preference data, Friedman statistic, multinomial test statistic, intermediate test statistic. ... new method and consequently a new statistic ...

  2. Securing cooperation from persons supplying statistical data.

    Science.gov (United States)

    AUBENQUE, M J; BLAIKLEY, R M; HARRIS, F F; LAL, R B; NEURDENBURG, M G; DE SHELLY HERNANDEZ, R

    1954-01-01

    Securing the co-operation of persons supplying information required for medical statistics is essentially a problem in human relations, and an understanding of the motivations, attitudes, and behaviour of the respondents is necessary.Before any new statistical survey is undertaken, it is suggested by Aubenque and Harris that a preliminary review be made so that the maximum use is made of existing information. Care should also be taken not to burden respondents with an overloaded questionnaire. Aubenque and Harris recommend simplified reporting. Complete population coverage is not necessary.Neurdenburg suggests that the co-operation and support of such organizations as medical associations and social security boards are important and that propaganda should be directed specifically to the groups whose co-operation is sought. Informal personal contacts are valuable and desirable, according to Blaikley, but may have adverse effects if the right kind of approach is not made.Financial payments as an incentive in securing co-operation are opposed by Neurdenburg, who proposes that only postage-free envelopes or similar small favours be granted. Blaikley and Harris, on the other hand, express the view that financial incentives may do much to gain the support of those required to furnish data; there are, however, other incentives, and full use should be made of the natural inclinations of respondents. Compulsion may be necessary in certain instances, but administrative rather than statutory measures should be adopted. Penalties, according to Aubenque, should be inflicted only when justified by imperative health requirements.The results of surveys should be made available as soon as possible to those who co-operated, and Aubenque and Harris point out that they should also be of practical value to the suppliers of the information.Greater co-operation can be secured from medical persons who have an understanding of the statistical principles involved; Aubenque and Neurdenburg

  3. Encoding Dissimilarity Data for Statistical Model Building.

    Science.gov (United States)

    Wahba, Grace

    2010-12-01

    We summarize, review and comment upon three papers which discuss the use of discrete, noisy, incomplete, scattered pairwise dissimilarity data in statistical model building. Convex cone optimization codes are used to embed the objects into a Euclidean space which respects the dissimilarity information while controlling the dimension of the space. A "newbie" algorithm is provided for embedding new objects into this space. This allows the dissimilarity information to be incorporated into a Smoothing Spline ANOVA penalized likelihood model, a Support Vector Machine, or any model that will admit Reproducing Kernel Hilbert Space components, for nonparametric regression, supervised learning, or semi-supervised learning. Future work and open questions are discussed. The papers are: F. Lu, S. Keles, S. Wright and G. Wahba 2005. A framework for kernel regularization with application to protein clustering. Proceedings of the National Academy of Sciences 102, 12332-1233.G. Corrada Bravo, G. Wahba, K. Lee, B. Klein, R. Klein and S. Iyengar 2009. Examining the relative influence of familial, genetic and environmental covariate information in flexible risk models. Proceedings of the National Academy of Sciences 106, 8128-8133F. Lu, Y. Lin and G. Wahba. Robust manifold unfolding with kernel regularization. TR 1008, Department of Statistics, University of Wisconsin-Madison.

  4. Statistical Analysis of Data for Timber Strengths

    DEFF Research Database (Denmark)

    Sørensen, John Dalsgaard; Hoffmeyer, P.

    Statistical analyses are performed for material strength parameters from approximately 6700 specimens of structural timber. Non-parametric statistical analyses and fits to the following distributions types have been investigated: Normal, Lognormal, 2 parameter Weibull and 3-parameter Weibull...

  5. A Climate Statistics Tool and Data Repository

    Science.gov (United States)

    Wang, J.; Kotamarthi, V. R.; Kuiper, J. A.; Orr, A.

    2017-12-01

    Researchers at Argonne National Laboratory and collaborating organizations have generated regional scale, dynamically downscaled climate model output using Weather Research and Forecasting (WRF) version 3.3.1 at a 12km horizontal spatial resolution over much of North America. The WRF model is driven by boundary conditions obtained from three independent global scale climate models and two different future greenhouse gas emission scenarios, named representative concentration pathways (RCPs). The repository of results has a temporal resolution of three hours for all the simulations, includes more than 50 variables, is stored in Network Common Data Form (NetCDF) files, and the data volume is nearly 600Tb. A condensed 800Gb set of NetCDF files were made for selected variables most useful for climate-related planning, including daily precipitation, relative humidity, solar radiation, maximum temperature, minimum temperature, and wind. The WRF model simulations are conducted for three 10-year time periods (1995-2004, 2045-2054, and 2085-2094), and two future scenarios RCP4.5 and RCP8.5). An open-source tool was coded using Python 2.7.8 and ESRI ArcGIS 10.3.1 programming libraries to parse the NetCDF files, compute summary statistics, and output results as GIS layers. Eight sets of summary statistics were generated as examples for the contiguous U.S. states and much of Alaska, including number of days over 90°F, number of days with a heat index over 90°F, heat waves, monthly and annual precipitation, drought, extreme precipitation, multi-model averages, and model bias. This paper will provide an overview of the project to generate the main and condensed data repositories, describe the Python tool and how to use it, present the GIS results of the computed examples, and discuss some of the ways they can be used for planning. The condensed climate data, Python tool, computed GIS results, and documentation of the work are shared on the Internet.

  6. STATISTICS, Program System for Statistical Analysis of Experimental Data

    International Nuclear Information System (INIS)

    Helmreich, F.

    1991-01-01

    1 - Description of problem or function: The package is composed of 83 routines, the most important of which are the following: BINDTR: Binomial distribution; HYPDTR: Hypergeometric distribution; POIDTR: Poisson distribution; GAMDTR: Gamma distribution; BETADTR: Beta-1 and Beta-2 distributions; NORDTR: Normal distribution; CHIDTR: Chi-square distribution; STUDTR : Distribution of 'Student's T'; FISDTR: Distribution of F; EXPDTR: Exponential distribution; WEIDTR: Weibull distribution; FRAKTIL: Calculation of the fractiles of the normal, chi-square, Student's, and F distributions; VARVGL: Test for equality of variance for several sample observations; ANPAST: Kolmogorov-Smirnov test and chi-square test of goodness of fit; MULIRE: Multiple linear regression analysis for a dependent variable and a set of independent variables; STPRG: Performs a stepwise multiple linear regression analysis for a dependent variable and a set of independent variables. At each step, the variable entered into the regression equation is the one which has the greatest amount of variance between it and the dependent variable. Any independent variable can be forced into or deleted from the regression equation, irrespective of its contribution to the equation. LTEST: Tests the hypotheses of linearity of the data. SPRANK: Calculates the Spearman rank correlation coefficient. 2 - Method of solution: VARVGL: The Bartlett's Test, the Cochran's Test and the Hartley's Test are performed in the program. MULIRE: The Gauss-Jordan method is used in the solution of the normal equations. STPRG: The abbreviated Doolittle method is used to (1) determine variables to enter into the regression, and (2) complete regression coefficient calculation. 3 - Restrictions on the complexity of the problem: VARVGL: The Hartley's Test is only performed if the sample observations are all of the same size

  7. Network Data: Statistical Theory and New Models

    Science.gov (United States)

    2016-02-17

    and with environmental scientists at JPL and Emory University to retrieval from NASA MISR remote sensing images aerosol index AOD for air pollution ...Beijing, May, 2013 Beijing Statistics Forum, Beijing, May, 2013 Statistics Seminar, CREST-ENSAE, Paris , March, 2013 Statistics Seminar, University...to retrieval from NASA MISR remote sensing images aerosol index AOD for air pollution monitoring and management. Satellite- retrieved Aerosol Optical

  8. Proceedings of the Pacific Rim Statistical Conference for Production Engineering : Big Data, Production Engineering and Statistics

    CERN Document Server

    Jang, Daeheung; Lai, Tze; Lee, Youngjo; Lu, Ying; Ni, Jun; Qian, Peter; Qiu, Peihua; Tiao, George

    2018-01-01

    This book presents the proceedings of the 2nd Pacific Rim Statistical Conference for Production Engineering: Production Engineering, Big Data and Statistics, which took place at Seoul National University in Seoul, Korea in December, 2016. The papers included discuss a wide range of statistical challenges, methods and applications for big data in production engineering, and introduce recent advances in relevant statistical methods.

  9. Symbolic Data Analysis Conceptual Statistics and Data Mining

    CERN Document Server

    Billard, Lynne

    2012-01-01

    With the advent of computers, very large datasets have become routine. Standard statistical methods don't have the power or flexibility to analyse these efficiently, and extract the required knowledge. An alternative approach is to summarize a large dataset in such a way that the resulting summary dataset is of a manageable size and yet retains as much of the knowledge in the original dataset as possible. One consequence of this is that the data may no longer be formatted as single values, but be represented by lists, intervals, distributions, etc. The summarized data have their own internal s

  10. Data bases and statistical systems: demography

    NARCIS (Netherlands)

    Kreyenfeld, M.; Willekens, F.J.; Wright, James D.

    2015-01-01

    This article deals with the availability of large-scale data for demographic analysis. The main sources of data that demographers work with are censuses data, microcensus data, population registers, other administrative data, survey data, and big data. Data of this kind can be used to generate

  11. 47 CFR 1.363 - Introduction of statistical data.

    Science.gov (United States)

    2010-10-01

    ... 47 Telecommunication 1 2010-10-01 2010-10-01 false Introduction of statistical data. 1.363 Section... Proceedings Evidence § 1.363 Introduction of statistical data. (a) All statistical studies, offered in... analyses, and experiments, and those parts of other studies involving statistical methodology shall be...

  12. Statistical process control for serially correlated data

    NARCIS (Netherlands)

    Wieringa, Jakob Edo

    1999-01-01

    Statistical Process Control (SPC) aims at quality improvement through reduction of variation. The best known tool of SPC is the control chart. Over the years, the control chart has proved to be a successful practical technique for monitoring process measurements. However, its usefulness in practice

  13. Statistical data of the uranium industry

    International Nuclear Information System (INIS)

    1976-01-01

    Historical facts and figures of the uranium industry through 1975 are compiled. Areas covered are ore and concentrate purchases; uranium resources; distribution of $10, $15, and $30 reserves; drilling statistics; uranium exploration expenditures; land holdings for uranium mining and exploration; employment; commercial U 3 O 8 sales and requirements; and processing mills

  14. Statistical Analysis Of Reconnaissance Geochemical Data From ...

    African Journals Online (AJOL)

    , Co, Mo, Hg, Sb, Tl, Sc, Cr, Ni, La, W, V, U, Th, Bi, Sr and Ga in 56 stream sediment samples collected from Orle drainage system were subjected to univariate and multivariate statistical analyses. The univariate methods used include ...

  15. Challenges in dental statistics: data and modelling

    OpenAIRE

    Matranga, D.; Castiglia, P.; Solinas, G.

    2013-01-01

    The aim of this work is to present the reflections and proposals derived from the first Workshop of the SISMEC STATDENT working group on statistical methods and applications in dentistry, held in Ancona (Italy) on 28th September 2011. STATDENT began as a forum of comparison and discussion for statisticians working in the field of dental research in order to suggest new and improve existing biostatistical and clinical epidemiological methods. During the meeting, we dealt with very important to...

  16. Statistical analysis of medical data using SAS

    CERN Document Server

    Der, Geoff

    2005-01-01

    An Introduction to SASDescribing and Summarizing DataBasic InferenceScatterplots Correlation: Simple Regression and SmoothingAnalysis of Variance and CovarianceMultiple RegressionLogistic RegressionThe Generalized Linear ModelGeneralized Additive ModelsNonlinear Regression ModelsThe Analysis of Longitudinal Data IThe Analysis of Longitudinal Data II: Models for Normal Response VariablesThe Analysis of Longitudinal Data III: Non-Normal ResponseSurvival AnalysisAnalysis Multivariate Date: Principal Components and Cluster AnalysisReferences

  17. Application of descriptive statistics in analysis of experimental data

    OpenAIRE

    Mirilović Milorad; Pejin Ivana

    2008-01-01

    Statistics today represent a group of scientific methods for the quantitative and qualitative investigation of variations in mass appearances. In fact, statistics present a group of methods that are used for the accumulation, analysis, presentation and interpretation of data necessary for reaching certain conclusions. Statistical analysis is divided into descriptive statistical analysis and inferential statistics. The values which represent the results of an experiment, and which are the subj...

  18. The Statistical Analysis of Failure Time Data

    CERN Document Server

    Kalbfleisch, John D

    2011-01-01

    Contains additional discussion and examples on left truncation as well as material on more general censoring and truncation patterns.Introduces the martingale and counting process formulation swil lbe in a new chapter.Develops multivariate failure time data in a separate chapter and extends the material on Markov and semi Markov formulations.Presents new examples and applications of data analysis.

  19. Statistical Challenges in "Big Data" Human Neuroimaging.

    Science.gov (United States)

    Smith, Stephen M; Nichols, Thomas E

    2018-01-17

    Smith and Nichols discuss "big data" human neuroimaging studies, with very large subject numbers and amounts of data. These studies provide great opportunities for making new discoveries about the brain but raise many new analytical challenges and interpretational risks. Copyright © 2017 Elsevier Inc. All rights reserved.

  20. Statistical Analysis of Research Data | Center for Cancer Research

    Science.gov (United States)

    Recent advances in cancer biology have resulted in the need for increased statistical analysis of research data. The Statistical Analysis of Research Data (SARD) course will be held on April 5-6, 2018 from 9 a.m.-5 p.m. at the National Institutes of Health's Natcher Conference Center, Balcony C on the Bethesda Campus. SARD is designed to provide an overview on the general principles of statistical analysis of research data.  The first day will feature univariate data analysis, including descriptive statistics, probability distributions, one- and two-sample inferential statistics.

  1. Fetal Alcohol Spectrum Disorders (FASDs): Data and Statistics

    Science.gov (United States)

    ... alcohol screening and counseling for all women Data & Statistics Recommend on Facebook Tweet Share Compartir Prevalence of ... conducted annually by the National Center for Health Statistics (NCHS), CDC, to produce national estimates for a ...

  2. Statistical methods for categorical data analysis

    CERN Document Server

    Powers, Daniel

    2008-01-01

    This book provides a comprehensive introduction to methods and models for categorical data analysis and their applications in social science research. Companion website also available, at https://webspace.utexas.edu/dpowers/www/

  3. Statistical modeling and extrapolation of carcinogenesis data

    International Nuclear Information System (INIS)

    Krewski, D.; Murdoch, D.; Dewanji, A.

    1986-01-01

    Mathematical models of carcinogenesis are reviewed, including pharmacokinetic models for metabolic activation of carcinogenic substances. Maximum likelihood procedures for fitting these models to epidemiological data are discussed, including situations where the time to tumor occurrence is unobservable. The plausibility of different possible shapes of the dose response curve at low doses is examined, and a robust method for linear extrapolation to low doses is proposed and applied to epidemiological data on radiation carcinogenesis

  4. Statistical multistep direct and statistical multistep compound models for calculations of nuclear data for applications

    International Nuclear Information System (INIS)

    Seeliger, D.

    1993-01-01

    This contribution contains a brief presentation and comparison of the different Statistical Multistep Approaches, presently available for practical nuclear data calculations. (author). 46 refs, 5 figs

  5. Statistical methods for handling incomplete data

    CERN Document Server

    Kim, Jae Kwang

    2013-01-01

    ""… this book nicely blends the theoretical material and its application through examples, and will be of interest to students and researchers as a textbook or a reference book. Extensive coverage of recent advances in handling missing data provides resources and guidelines for researchers and practitioners in implementing the methods in new settings. … I plan to use this as a textbook for my teaching and highly recommend it.""-Biometrics, September 2014

  6. Statistics

    CERN Document Server

    Hayslett, H T

    1991-01-01

    Statistics covers the basic principles of Statistics. The book starts by tackling the importance and the two kinds of statistics; the presentation of sample data; the definition, illustration and explanation of several measures of location; and the measures of variation. The text then discusses elementary probability, the normal distribution and the normal approximation to the binomial. Testing of statistical hypotheses and tests of hypotheses about the theoretical proportion of successes in a binomial population and about the theoretical mean of a normal population are explained. The text the

  7. Plasma data analysis using statistical analysis system

    International Nuclear Information System (INIS)

    Yoshida, Z.; Iwata, Y.; Fukuda, Y.; Inoue, N.

    1987-01-01

    Multivariate factor analysis has been applied to a plasma data base of REPUTE-1. The characteristics of the reverse field pinch plasma in REPUTE-1 are shown to be explained by four independent parameters which are described in the report. The well known scaling laws F/sub chi/ proportional to I/sub p/, T/sub e/ proportional to I/sub p/, and tau/sub E/ proportional to N/sub e/ are also confirmed. 4 refs., 8 figs., 1 tab

  8. Statistics

    Science.gov (United States)

    Links to sources of cancer-related statistics, including the Surveillance, Epidemiology and End Results (SEER) Program, SEER-Medicare datasets, cancer survivor prevalence data, and the Cancer Trends Progress Report.

  9. Using Facebook Data to Turn Introductory Statistics Students into Consultants

    Science.gov (United States)

    Childers, Adam F.

    2017-01-01

    Facebook provides businesses and organizations with copious data that describe how users are interacting with their page. This data affords an excellent opportunity to turn introductory statistics students into consultants to analyze the Facebook data using descriptive and inferential statistics. This paper details a semester-long project that…

  10. Statistical data processing with automatic system for environmental radiation monitoring

    International Nuclear Information System (INIS)

    Zarkh, V.G.; Ostroglyadov, S.V.

    1986-01-01

    Practice of statistical data processing for radiation monitoring is exemplified, and some results obtained are presented. Experience in practical application of mathematical statistics methods for radiation monitoring data processing allowed to develop a concrete algorithm of statistical processing realized in M-6000 minicomputer. The suggested algorithm by its content is divided into 3 parts: parametrical data processing and hypotheses test, pair and multiple correlation analysis. Statistical processing programms are in a dialogue operation. The above algorithm was used to process observed data over radioactive waste disposal control region. Results of surface waters monitoring processing are presented

  11. A nonparametric spatial scan statistic for continuous data.

    Science.gov (United States)

    Jung, Inkyung; Cho, Ho Jin

    2015-10-20

    Spatial scan statistics are widely used for spatial cluster detection, and several parametric models exist. For continuous data, a normal-based scan statistic can be used. However, the performance of the model has not been fully evaluated for non-normal data. We propose a nonparametric spatial scan statistic based on the Wilcoxon rank-sum test statistic and compared the performance of the method with parametric models via a simulation study under various scenarios. The nonparametric method outperforms the normal-based scan statistic in terms of power and accuracy in almost all cases under consideration in the simulation study. The proposed nonparametric spatial scan statistic is therefore an excellent alternative to the normal model for continuous data and is especially useful for data following skewed or heavy-tailed distributions.

  12. Statistics of meteorological data at Tokai Research Establishment in JAERI

    International Nuclear Information System (INIS)

    Sekita, Tsutomu; Tachibana, Haruo; Matsuura, Kenichi; Yamaguchi, Takenori

    2003-12-01

    The meteorological observation data at Tokai site were analyzed statistically based on a 'Guideline of meteorological statistics for the safety analysis of nuclear power reactor' (Nuclear Safety Commission on January 28, 1982; revised on March 29, 2001). This report shows the meteorological analysis of wind direction, wind velocity and atmospheric stability etc. to assess the public dose around the Tokai site caused by the released gaseous radioactivity. The statistical period of meteorological data is every 5 years from 1981 to 1995. (author)

  13. Using Data from Climate Science to Teach Introductory Statistics

    Science.gov (United States)

    Witt, Gary

    2013-01-01

    This paper shows how the application of simple statistical methods can reveal to students important insights from climate data. While the popular press is filled with contradictory opinions about climate science, teachers can encourage students to use introductory-level statistics to analyze data for themselves on this important issue in public…

  14. The value of statistical tools to detect data fabrication

    NARCIS (Netherlands)

    Hartgerink, C.H.J.; Wicherts, J.M.; van Assen, M.A.L.M.

    2016-01-01

    We aim to investigate how statistical tools can help detect potential data fabrication in the social- and medical sciences. In this proposal we outline three projects to assess the value of such statistical tools to detect potential data fabrication and make the first steps in order to apply them

  15. Journal data sharing policies and statistical reporting inconsistencies in psychology.

    NARCIS (Netherlands)

    Nuijten, M.B.; Borghuis, J.; Veldkamp, C.L.S.; Dominguez Alvarez, L.; van Assen, M.A.L.M.; Wicherts, J.M.

    2018-01-01

    In this paper, we present three retrospective observational studies that investigate the relation between data sharing and statistical reporting inconsistencies. Previous research found that reluctance to share data was related to a higher prevalence of statistical errors, often in the direction of

  16. Simple statistical methods for software engineering data and patterns

    CERN Document Server

    Pandian, C Ravindranath

    2015-01-01

    Although there are countless books on statistics, few are dedicated to the application of statistical methods to software engineering. Simple Statistical Methods for Software Engineering: Data and Patterns fills that void. Instead of delving into overly complex statistics, the book details simpler solutions that are just as effective and connect with the intuition of problem solvers.Sharing valuable insights into software engineering problems and solutions, the book not only explains the required statistical methods, but also provides many examples, review questions, and case studies that prov

  17. National Vital Statistics System (NVSS) - National Cardiovascular Disease Surveillance Data

    Data.gov (United States)

    U.S. Department of Health & Human Services — 2000 forward. NVSS is a secure, web-based data management system that collects and disseminates the Nation's official vital statistics. Indicators from this data...

  18. Experimental uncertainty estimation and statistics for data having interval uncertainty.

    Energy Technology Data Exchange (ETDEWEB)

    Kreinovich, Vladik (Applied Biomathematics, Setauket, New York); Oberkampf, William Louis (Applied Biomathematics, Setauket, New York); Ginzburg, Lev (Applied Biomathematics, Setauket, New York); Ferson, Scott (Applied Biomathematics, Setauket, New York); Hajagos, Janos (Applied Biomathematics, Setauket, New York)

    2007-05-01

    This report addresses the characterization of measurements that include epistemic uncertainties in the form of intervals. It reviews the application of basic descriptive statistics to data sets which contain intervals rather than exclusively point estimates. It describes algorithms to compute various means, the median and other percentiles, variance, interquartile range, moments, confidence limits, and other important statistics and summarizes the computability of these statistics as a function of sample size and characteristics of the intervals in the data (degree of overlap, size and regularity of widths, etc.). It also reviews the prospects for analyzing such data sets with the methods of inferential statistics such as outlier detection and regressions. The report explores the tradeoff between measurement precision and sample size in statistical results that are sensitive to both. It also argues that an approach based on interval statistics could be a reasonable alternative to current standard methods for evaluating, expressing and propagating measurement uncertainties.

  19. Numeric computation and statistical data analysis on the Java platform

    CERN Document Server

    Chekanov, Sergei V

    2016-01-01

    Numerical computation, knowledge discovery and statistical data analysis integrated with powerful 2D and 3D graphics for visualization are the key topics of this book. The Python code examples powered by the Java platform can easily be transformed to other programming languages, such as Java, Groovy, Ruby and BeanShell. This book equips the reader with a computational platform which, unlike other statistical programs, is not limited by a single programming language. The author focuses on practical programming aspects and covers a broad range of topics, from basic introduction to the Python language on the Java platform (Jython), to descriptive statistics, symbolic calculations, neural networks, non-linear regression analysis and many other data-mining topics. He discusses how to find regularities in real-world data, how to classify data, and how to process data for knowledge discoveries. The code snippets are so short that they easily fit into single pages. Numeric Computation and Statistical Data Analysis ...

  20. Application of Ontology Technology in Health Statistic Data Analysis.

    Science.gov (United States)

    Guo, Minjiang; Hu, Hongpu; Lei, Xingyun

    2017-01-01

    Research Purpose: establish health management ontology for analysis of health statistic data. Proposed Methods: this paper established health management ontology based on the analysis of the concepts in China Health Statistics Yearbook, and used protégé to define the syntactic and semantic structure of health statistical data. six classes of top-level ontology concepts and their subclasses had been extracted and the object properties and data properties were defined to establish the construction of these classes. By ontology instantiation, we can integrate multi-source heterogeneous data and enable administrators to have an overall understanding and analysis of the health statistic data. ontology technology provides a comprehensive and unified information integration structure of the health management domain and lays a foundation for the efficient analysis of multi-source and heterogeneous health system management data and enhancement of the management efficiency.

  1. Longitudinal data analysis a handbook of modern statistical methods

    CERN Document Server

    Fitzmaurice, Garrett; Verbeke, Geert; Molenberghs, Geert

    2008-01-01

    Although many books currently available describe statistical models and methods for analyzing longitudinal data, they do not highlight connections between various research threads in the statistical literature. Responding to this void, Longitudinal Data Analysis provides a clear, comprehensive, and unified overview of state-of-the-art theory and applications. It also focuses on the assorted challenges that arise in analyzing longitudinal data. After discussing historical aspects, leading researchers explore four broad themes: parametric modeling, nonparametric and semiparametric methods, joint

  2. Complex Data Modeling and Computationally Intensive Statistical Methods

    CERN Document Server

    Mantovan, Pietro

    2010-01-01

    The last years have seen the advent and development of many devices able to record and store an always increasing amount of complex and high dimensional data; 3D images generated by medical scanners or satellite remote sensing, DNA microarrays, real time financial data, system control datasets. The analysis of this data poses new challenging problems and requires the development of novel statistical models and computational methods, fueling many fascinating and fast growing research areas of modern statistics. The book offers a wide variety of statistical methods and is addressed to statistici

  3. Method for statistical data analysis of multivariate observations

    CERN Document Server

    Gnanadesikan, R

    1997-01-01

    A practical guide for multivariate statistical techniques-- now updated and revised In recent years, innovations in computer technology and statistical methodologies have dramatically altered the landscape of multivariate data analysis. This new edition of Methods for Statistical Data Analysis of Multivariate Observations explores current multivariate concepts and techniques while retaining the same practical focus of its predecessor. It integrates methods and data-based interpretations relevant to multivariate analysis in a way that addresses real-world problems arising in many areas of inte

  4. Using Data Mining to Teach Applied Statistics and Correlation

    Science.gov (United States)

    Hartnett, Jessica L.

    2016-01-01

    This article describes two class activities that introduce the concept of data mining and very basic data mining analyses. Assessment data suggest that students learned some of the conceptual basics of data mining, understood some of the ethical concerns related to the practice, and were able to perform correlations via the Statistical Package for…

  5. Insights in Experimental Data : Interactive Statistics with the ILLMO Program

    NARCIS (Netherlands)

    Martens, J.B.O.S.

    2017-01-01

    Empirical researchers turn to statistics to assist them in drawing conclusions, also called inferences, from their collected data. Often, this data is experimental data, i.e., it consists of (repeated) measurements collected in one or more distinct conditions. The observed data can hence be

  6. Explorations in Statistics: The Analysis of Ratios and Normalized Data

    Science.gov (United States)

    Curran-Everett, Douglas

    2013-01-01

    Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This ninth installment of "Explorations in Statistics" explores the analysis of ratios and normalized--or standardized--data. As researchers, we compute a ratio--a numerator divided by a denominator--to compute a…

  7. STATCAT, Statistical Analysis of Parametric and Non-Parametric Data

    International Nuclear Information System (INIS)

    David, Hugh

    1990-01-01

    1 - Description of program or function: A suite of 26 programs designed to facilitate the appropriate statistical analysis and data handling of parametric and non-parametric data, using classical and modern univariate and multivariate methods. 2 - Method of solution: Data is read entry by entry, using a choice of input formats, and the resultant data bank is checked for out-of- range, rare, extreme or missing data. The completed STATCAT data bank can be treated by a variety of descriptive and inferential statistical methods, and modified, using other standard programs as required

  8. Enterprise Human Resources Integration-Statistical Data Mart (EHRI-SDM) Status Data

    Data.gov (United States)

    Office of Personnel Management — The Enterprise Human Resources Integration-Statistical Data Mart (EHRI-SDM) is a statistically cleansed sub-set of the data contained in the EHRI data warehouse. It...

  9. Enterprise Human Resources Integration-Statistical Data Mart (EHRI-SDM) Dynamics Data

    Data.gov (United States)

    Office of Personnel Management — The Enterprise Human Resources Integration-Statistical Data Mart (EHRI-SDM) is a statistically cleansed sub-set of the data contained in the EHRI data warehouse. It...

  10. Maximum entropy prior uncertainty and correlation of statistical economic data

    NARCIS (Netherlands)

    Dias, Rodriques J.F.

    2016-01-01

    Empirical estimates of source statistical economic data such as trade flows, greenhouse gas emissions or employment figures are always subject to uncertainty (stemming from measurement errors or confidentiality) but information concerning that uncertainty is often missing. This paper uses concepts

  11. Improved custom statistics visualization for CA Performance Center data

    CERN Document Server

    Talevi, Iacopo

    2017-01-01

    The main goal of my project is to understand and experiment the possibilities that CA Performance Center (CA PC) offers for creating custom applications to display stored information through interesting visual means, such as maps. In particular, I have re-written some of the network statistics web pages in order to fetch data from new statistics modules in CA PC, which has its own API, and stop using the RRD data.

  12. Statistical methods of combining information: Applications to sensor data fusion

    Energy Technology Data Exchange (ETDEWEB)

    Burr, T.

    1996-12-31

    This paper reviews some statistical approaches to combining information from multiple sources. Promising new approaches will be described, and potential applications to combining not-so-different data sources such as sensor data will be discussed. Experiences with one real data set are described.

  13. LSD Dimensions: Use and Reuse of Linked Statistical Data

    NARCIS (Netherlands)

    Meroño-Peñuela, Albert

    2014-01-01

    RDF Data Cube (QB) has boosted the publication of Linked Statistical Data (LSD) on the Web, making them linkable to other related datasets and concepts following the Linked Data paradigm. In this demo we present LSD Dimensions, a web based application that monitors the usage of dimensions and codes

  14. Big Data as a Source for Official Statistics

    Directory of Open Access Journals (Sweden)

    Daas Piet J.H.

    2015-06-01

    Full Text Available More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. This article discusses the exploration of both opportunities and challenges for official statistics associated with the application of Big Data. Experiences gained with analyses of large amounts of Dutch traffic loop detection records and Dutch social media messages are described to illustrate the topics characteristic of the statistical analysis and use of Big Data.

  15. Testing the statistical compatibility of independent data sets

    International Nuclear Information System (INIS)

    Maltoni, M.; Schwetz, T.

    2003-01-01

    We discuss a goodness-of-fit method which tests the compatibility between statistically independent data sets. The method gives sensible results even in cases where the χ 2 minima of the individual data sets are very low or when several parameters are fitted to a large number of data points. In particular, it avoids the problem that a possible disagreement between data sets becomes diluted by data points which are insensitive to the crucial parameters. A formal derivation of the probability distribution function for the proposed test statistics is given, based on standard theorems of statistics. The application of the method is illustrated on data from neutrino oscillation experiments, and its complementarity to the standard goodness-of-fit is discussed

  16. Statistical summaries of selected Iowa streamflow data through September 2013

    Science.gov (United States)

    Eash, David A.; O'Shea, Padraic S.; Weber, Jared R.; Nguyen, Kevin T.; Montgomery, Nicholas L.; Simonson, Adrian J.

    2016-01-04

    Statistical summaries of streamflow data collected at 184 streamgages in Iowa are presented in this report. All streamgages included for analysis have at least 10 years of continuous record collected before or through September 2013. This report is an update to two previously published reports that presented statistical summaries of selected Iowa streamflow data through September 1988 and September 1996. The statistical summaries include (1) monthly and annual flow durations, (2) annual exceedance probabilities of instantaneous peak discharges (flood frequencies), (3) annual exceedance probabilities of high discharges, and (4) annual nonexceedance probabilities of low discharges and seasonal low discharges. Also presented for each streamgage are graphs of the annual mean discharges, mean annual mean discharges, 50-percent annual flow-duration discharges (median flows), harmonic mean flows, mean daily mean discharges, and flow-duration curves. Two sets of statistical summaries are presented for each streamgage, which include (1) long-term statistics for the entire period of streamflow record and (2) recent-term statistics for or during the 30-year period of record from 1984 to 2013. The recent-term statistics are only calculated for streamgages with streamflow records pre-dating the 1984 water year and with at least 10 years of record during 1984–2013. The streamflow statistics in this report are not adjusted for the effects of water use; although some of this water is used consumptively, most of it is returned to the streams.

  17. Basic statistical tools in research and data analysis

    Directory of Open Access Journals (Sweden)

    Zulfiqar Ali

    2016-01-01

    Full Text Available Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise only if proper statistical tests are used. This article will try to acquaint the reader with the basic research tools that are utilised while conducting various studies. The article covers a brief outline of the variables, an understanding of quantitative and qualitative variables and the measures of central tendency. An idea of the sample size estimation, power analysis and the statistical errors is given. Finally, there is a summary of parametric and non-parametric tests used for data analysis.

  18. HistFitter software framework for statistical data analysis

    CERN Document Server

    Baak, M.; Côte, D.; Koutsman, A.; Lorenz, J.; Short, D.

    2015-01-01

    We present a software framework for statistical data analysis, called HistFitter, that has been used extensively by the ATLAS Collaboration to analyze big datasets originating from proton-proton collisions at the Large Hadron Collider at CERN. Since 2012 HistFitter has been the standard statistical tool in searches for supersymmetric particles performed by ATLAS. HistFitter is a programmable and flexible framework to build, book-keep, fit, interpret and present results of data models of nearly arbitrary complexity. Starting from an object-oriented configuration, defined by users, the framework builds probability density functions that are automatically fitted to data and interpreted with statistical tests. A key innovation of HistFitter is its design, which is rooted in core analysis strategies of particle physics. The concepts of control, signal and validation regions are woven into its very fabric. These are progressively treated with statistically rigorous built-in methods. Being capable of working with mu...

  19. Statistical Estimators Using Jointly Administrative and Survey Data to Produce French Structural Business Statistics

    Directory of Open Access Journals (Sweden)

    Brion Philippe

    2015-12-01

    Full Text Available Using as much administrative data as possible is a general trend among most national statistical institutes. Different kinds of administrative sources, from tax authorities or other administrative bodies, are very helpful material in the production of business statistics. However, these sources often have to be completed by information collected through statistical surveys. This article describes the way Insee has implemented such a strategy in order to produce French structural business statistics. The originality of the French procedure is that administrative and survey variables are used jointly for the same enterprises, unlike the majority of multisource systems, in which the two kinds of sources generally complement each other for different categories of units. The idea is to use, as much as possible, the richness of the administrative sources combined with the timeliness of a survey, even if the latter is conducted only on a sample of enterprises. One main issue is the classification of enterprises within the NACE nomenclature, which is a cornerstone variable in producing the breakdown of the results by industry. At a given date, two values of the corresponding code may coexist: the value of the register, not necessarily up to date, and the value resulting from the data collected via the survey, but only from a sample of enterprises. Using all this information together requires the implementation of specific statistical estimators combining some properties of the difference estimators with calibration techniques. This article presents these estimators, as well as their statistical properties, and compares them with those of other methods.

  20. General statistical data structure for epidemiologic studies of DOE workers

    International Nuclear Information System (INIS)

    Frome, E.L.; Hudson, D.R.

    1981-01-01

    Epidemiologic studies to evaluate the occupational risks associated with employment in the nuclear industry are currently being conducted by the Department of Energy. Data that have potential value in evaluating any long-term health effects of occupational exposure to low levels of radiation are obtained for each individual at a given facility. We propose a general data structure for statistical analysis that is used to define transformations from the data management system into the data analysis system. Statistical methods of interest in epidemiologic studies include contingency table analysis and survival analysis procedures that can be used to evaluate potential associations between occupational radiation exposure and mortality. The purposes of this paper are to discuss (1) the adequacy of this data structure for single- and multiple-facility analysis and (2) the statistical computing problems encountered in dealing with large populations over extended periods of time

  1. Estimation of global network statistics from incomplete data.

    Directory of Open Access Journals (Sweden)

    Catherine A Bliss

    Full Text Available Complex networks underlie an enormous variety of social, biological, physical, and virtual systems. A profound complication for the science of complex networks is that in most cases, observing all nodes and all network interactions is impossible. Previous work addressing the impacts of partial network data is surprisingly limited, focuses primarily on missing nodes, and suggests that network statistics derived from subsampled data are not suitable estimators for the same network statistics describing the overall network topology. We generate scaling methods to predict true network statistics, including the degree distribution, from only partial knowledge of nodes, links, or weights. Our methods are transparent and do not assume a known generating process for the network, thus enabling prediction of network statistics for a wide variety of applications. We validate analytical results on four simulated network classes and empirical data sets of various sizes. We perform subsampling experiments by varying proportions of sampled data and demonstrate that our scaling methods can provide very good estimates of true network statistics while acknowledging limits. Lastly, we apply our techniques to a set of rich and evolving large-scale social networks, Twitter reply networks. Based on 100 million tweets, we use our scaling techniques to propose a statistical characterization of the Twitter Interactome from September 2008 to November 2008. Our treatment allows us to find support for Dunbar's hypothesis in detecting an upper threshold for the number of active social contacts that individuals maintain over the course of one week.

  2. Statistical Data Processing with R – Metadata Driven Approach

    Directory of Open Access Journals (Sweden)

    Rudi SELJAK

    2016-06-01

    Full Text Available In recent years the Statistical Office of the Republic of Slovenia has put a lot of effort into re-designing its statistical process. We replaced the classical stove-pipe oriented production system with general software solutions, based on the metadata driven approach. This means that one general program code, which is parametrized with process metadata, is used for data processing for a particular survey. Currently, the general program code is entirely based on SAS macros, but in the future we would like to explore how successfully statistical software R can be used for this approach. Paper describes the metadata driven principle for data validation, generic software solution and main issues connected with the use of statistical software R for this approach.

  3. Statistical distributions as applied to environmental surveillance data

    International Nuclear Information System (INIS)

    Speer, D.R.; Waite, D.A.

    1975-09-01

    Application of normal, log normal, and Weibull distributions to environmental surveillance data was investigated for approximately 300 nuclide-medium-year-location combinations. Corresponding W test calculations were made to determine the probability of a particular data set falling within the distribution of interest. Conclusions are drawn as to the fit of any data group to the various distributions. The significance of fitting statistical distributions to the data is discussed

  4. The application of bayesian statistic in data fit processing

    International Nuclear Information System (INIS)

    Guan Xingyin; Li Zhenfu; Song Zhaohui

    2010-01-01

    The rationality and disadvantage of least squares fitting that is usually used in data processing is analyzed, and the theory and commonly method that Bayesian statistic is applied in data processing is shown in detail. As it is proved in analysis, Bayesian approach avoid the limitative hypothesis that least squares fitting has in data processing, and the result has traits that it is more scientific and more easily understood, may replace the least squares fitting to apply in data processing. (authors)

  5. Statistics and data analysis for financial engineering with R examples

    CERN Document Server

    Ruppert, David

    2015-01-01

    The new edition of this influential textbook, geared towards graduate or advanced undergraduate students, teaches the statistics necessary for financial engineering. In doing so, it illustrates concepts using financial markets and economic data, R Labs with real-data exercises, and graphical and analytic methods for modeling and diagnosing modeling errors. Financial engineers now have access to enormous quantities of data. To make use of these data, the powerful methods in this book, particularly about volatility and risks, are essential. Strengths of this fully-revised edition include major additions to the R code and the advanced topics covered. Individual chapters cover, among other topics, multivariate distributions, copulas, Bayesian computations, risk management, multivariate volatility and cointegration. Suggested prerequisites are basic knowledge of statistics and probability, matrices and linear algebra, and calculus. There is an appendix on probability, statistics and linear algebra. Practicing fina...

  6. Data Warehousing: How To Make Your Statistics Meaningful.

    Science.gov (United States)

    Flaherty, William

    2001-01-01

    Examines how one school district found a way to turn data collection from a disparate mountain of statistics into more useful information by using their Instructional Decision Support System. System software is explained as is how the district solved some data management challenges. (GR)

  7. Using Carbon Emissions Data to "Heat Up" Descriptive Statistics

    Science.gov (United States)

    Brooks, Robert

    2012-01-01

    This article illustrates using carbon emissions data in an introductory statistics assignment. The carbon emissions data has desirable characteristics including: choice of measure; skewness; and outliers. These complexities allow research and public policy debate to be introduced. (Contains 4 figures and 2 tables.)

  8. Statistical mechanics of learning: A variational approach for real data

    International Nuclear Information System (INIS)

    Malzahn, Doerthe; Opper, Manfred

    2002-01-01

    Using a variational technique, we generalize the statistical physics approach of learning from random examples to make it applicable to real data. We demonstrate the validity and relevance of our method by computing approximate estimators for generalization errors that are based on training data alone

  9. Data management and statistical analysis for environmental assessment

    International Nuclear Information System (INIS)

    Wendelberger, J.R.; McVittie, T.I.

    1995-01-01

    Data management and statistical analysis for environmental assessment are important issues on the interface of computer science and statistics. Data collection for environmental decision making can generate large quantities of various types of data. A database/GIS system developed is described which provides efficient data storage as well as visualization tools which may be integrated into the data analysis process. FIMAD is a living database and GIS system. The system has changed and developed over time to meet the needs of the Los Alamos National Laboratory Restoration Program. The system provides a repository for data which may be accessed by different individuals for different purposes. The database structure is driven by the large amount and varied types of data required for environmental assessment. The integration of the database with the GIS system provides the foundation for powerful visualization and analysis capabilities

  10. Journal Data Sharing Policies and Statistical Reporting Inconsistencies in Psychology

    Directory of Open Access Journals (Sweden)

    Michèle B. Nuijten

    2017-12-01

    Full Text Available In this paper, we present three retrospective observational studies that investigate the relation between data sharing and statistical reporting inconsistencies. Previous research found that reluctance to share data was related to a higher prevalence of statistical errors, often in the direction of statistical significance (Wicherts, Bakker, & Molenaar, 2011. We therefore hypothesized that journal policies about data sharing and data sharing itself would reduce these inconsistencies. In Study 1, we compared the prevalence of reporting inconsistencies in two similar journals on decision making with different data sharing policies. In Study 2, we compared reporting inconsistencies in psychology articles published in PLOS journals (with a data sharing policy and Frontiers in Psychology (without a stipulated data sharing policy. In Study 3, we looked at papers published in the journal Psychological Science to check whether papers with or without an Open Practice Badge differed in the prevalence of reporting errors. Overall, we found no relationship between data sharing and reporting inconsistencies. We did find that journal policies on data sharing seem extremely effective in promoting data sharing. We argue that open data is essential in improving the quality of psychological science, and we discuss ways to detect and reduce reporting inconsistencies in the literature.

  11. Statistical and Visualization Data Mining Tools for Foundry Production

    Directory of Open Access Journals (Sweden)

    M. Perzyk

    2007-07-01

    Full Text Available In recent years a rapid development of a new, interdisciplinary knowledge area, called data mining, is observed. Its main task is extracting useful information from previously collected large amount of data. The main possibilities and potential applications of data mining in manufacturing industry are characterized. The main types of data mining techniques are briefly discussed, including statistical, artificial intelligence, data base and visualization tools. The statistical methods and visualization methods are presented in more detail, showing their general possibilities, advantages as well as characteristic examples of applications in foundry production. Results of the author’s research are presented, aimed at validation of selected statistical tools which can be easily and effectively used in manufacturing industry. A performance analysis of ANOVA and contingency tables based methods, dedicated for determination of the most significant process parameters as well as for detection of possible interactions among them, has been made. Several numerical tests have been performed using simulated data sets, with assumed hidden relationships as well some real data, related to the strength of ductile cast iron, collected in a foundry. It is concluded that the statistical methods offer relatively easy and fairly reliable tools for extraction of that type of knowledge about foundry manufacturing processes. However, further research is needed, aimed at explanation of some imperfections of the investigated tools as well assessment of their validity for more complex tasks.

  12. HistFitter software framework for statistical data analysis

    Energy Technology Data Exchange (ETDEWEB)

    Baak, M. [CERN, Geneva (Switzerland); Besjes, G.J. [Radboud University Nijmegen, Nijmegen (Netherlands); Nikhef, Amsterdam (Netherlands); Cote, D. [University of Texas, Arlington (United States); Koutsman, A. [TRIUMF, Vancouver (Canada); Lorenz, J. [Ludwig-Maximilians-Universitaet Muenchen, Munich (Germany); Excellence Cluster Universe, Garching (Germany); Short, D. [University of Oxford, Oxford (United Kingdom)

    2015-04-15

    We present a software framework for statistical data analysis, called HistFitter, that has been used extensively by the ATLAS Collaboration to analyze big datasets originating from proton-proton collisions at the Large Hadron Collider at CERN. Since 2012 HistFitter has been the standard statistical tool in searches for supersymmetric particles performed by ATLAS. HistFitter is a programmable and flexible framework to build, book-keep, fit, interpret and present results of data models of nearly arbitrary complexity. Starting from an object-oriented configuration, defined by users, the framework builds probability density functions that are automatically fit to data and interpreted with statistical tests. Internally HistFitter uses the statistics packages RooStats and HistFactory. A key innovation of HistFitter is its design, which is rooted in analysis strategies of particle physics. The concepts of control, signal and validation regions are woven into its fabric. These are progressively treated with statistically rigorous built-in methods. Being capable of working with multiple models at once that describe the data, HistFitter introduces an additional level of abstraction that allows for easy bookkeeping, manipulation and testing of large collections of signal hypotheses. Finally, HistFitter provides a collection of tools to present results with publication quality style through a simple command-line interface. (orig.)

  13. HistFitter software framework for statistical data analysis

    International Nuclear Information System (INIS)

    Baak, M.; Besjes, G.J.; Cote, D.; Koutsman, A.; Lorenz, J.; Short, D.

    2015-01-01

    We present a software framework for statistical data analysis, called HistFitter, that has been used extensively by the ATLAS Collaboration to analyze big datasets originating from proton-proton collisions at the Large Hadron Collider at CERN. Since 2012 HistFitter has been the standard statistical tool in searches for supersymmetric particles performed by ATLAS. HistFitter is a programmable and flexible framework to build, book-keep, fit, interpret and present results of data models of nearly arbitrary complexity. Starting from an object-oriented configuration, defined by users, the framework builds probability density functions that are automatically fit to data and interpreted with statistical tests. Internally HistFitter uses the statistics packages RooStats and HistFactory. A key innovation of HistFitter is its design, which is rooted in analysis strategies of particle physics. The concepts of control, signal and validation regions are woven into its fabric. These are progressively treated with statistically rigorous built-in methods. Being capable of working with multiple models at once that describe the data, HistFitter introduces an additional level of abstraction that allows for easy bookkeeping, manipulation and testing of large collections of signal hypotheses. Finally, HistFitter provides a collection of tools to present results with publication quality style through a simple command-line interface. (orig.)

  14. Data analysis using the Gnu R system for statistical computation

    Energy Technology Data Exchange (ETDEWEB)

    Simone, James; /Fermilab

    2011-07-01

    R is a language system for statistical computation. It is widely used in statistics, bioinformatics, machine learning, data mining, quantitative finance, and the analysis of clinical drug trials. Among the advantages of R are: it has become the standard language for developing statistical techniques, it is being actively developed by a large and growing global user community, it is open source software, it is highly portable (Linux, OS-X and Windows), it has a built-in documentation system, it produces high quality graphics and it is easily extensible with over four thousand extension library packages available covering statistics and applications. This report gives a very brief introduction to R with some examples using lattice QCD simulation results. It then discusses the development of R packages designed for chi-square minimization fits for lattice n-pt correlation functions.

  15. Statistical methods for longitudinal data with agricultural applications

    DEFF Research Database (Denmark)

    Anantharama Ankinakatte, Smitha

    The PhD study focuses on modeling two kings of longitudinal data arising in agricultural applications: continuous time series data and discrete longitudinal data. Firstly, two statistical methods, neural networks and generalized additive models, are applied to predict masistis using multivariate...... algorithm. This was found to compare favourably with the algorithm implemented in the well-known Beagle software. Finally, an R package to apply APFA models developed as part of the PhD project is described...

  16. Reducing bias in the analysis of counting statistics data

    International Nuclear Information System (INIS)

    Hammersley, A.P.; Antoniadis, A.

    1997-01-01

    In the analysis of counting statistics data it is common practice to estimate the variance of the measured data points as the data points themselves. This practice introduces a bias into the results of further analysis which may be significant, and under certain circumstances lead to false conclusions. In the case of normal weighted least squares fitting this bias is quantified and methods to avoid it are proposed. (orig.)

  17. QB2OLAP : enabling OLAP on statistical linked open data

    OpenAIRE

    Varga, Jovan; Etcheverry, Lorena; Vaisman, Alejandro; Romero Moral, Óscar; Bach Pedersen, Torben; Thomsen, Christian

    2016-01-01

    Publication and sharing of multidimensional (MD) data on the Semantic Web (SW) opens new opportunities for the use of On-Line Analytical Processing (OLAP). The RDF Data Cube (QB) vocabulary, the current standard for statistical data publishing, however, lacks key MD concepts such as dimension hierarchies and aggregate functions. QB4OLAP was proposed to remedy this. However, QB4OLAP requires extensive manual annotation and users must still write queries in SPARQL, the standard query language f...

  18. Some statistical properties of gene expression clustering for array data

    DEFF Research Database (Denmark)

    Abreu, G C G; Pinheiro, A; Drummond, R D

    2010-01-01

    DNA array data without a corresponding statistical error measure. We propose an easy-to-implement and simple-to-use technique that uses bootstrap re-sampling to evaluate the statistical error of the nodes provided by SOM-based clustering. Comparisons between SOM and parametric clustering are presented...... for simulated as well as for two real data sets. We also implement a bootstrap-based pre-processing procedure for SOM, that improves the false discovery ratio of differentially expressed genes. Code in Matlab is freely available, as well as some supplementary material, at the following address: https...

  19. Statistical application of groundwater monitoring data at the Hanford Site

    International Nuclear Information System (INIS)

    Chou, C.J.; Johnson, V.G.; Hodges, F.N.

    1993-09-01

    Effective use of groundwater monitoring data requires both statistical and geohydrologic interpretations. At the Hanford Site in south-central Washington state such interpretations are used for (1) detection monitoring, assessment monitoring, and/or corrective action at Resource Conservation and Recovery Act sites; (2) compliance testing for operational groundwater surveillance; (3) impact assessments at active liquid-waste disposal sites; and (4) cleanup decisions at Comprehensive Environmental Response Compensation and Liability Act sites. Statistical tests such as the Kolmogorov-Smirnov two-sample test are used to test the hypothesis that chemical concentrations from spatially distinct subsets or populations are identical within the uppermost unconfined aquifer. Experience at the Hanford Site in applying groundwater background data indicates that background must be considered as a statistical distribution of concentrations, rather than a single value or threshold. The use of a single numerical value as a background-based standard ignores important information and may result in excessive or unnecessary remediation. Appropriate statistical evaluation techniques include Wilcoxon rank sum test, Quantile test, ''hot spot'' comparisons, and Kolmogorov-Smirnov types of tests. Application of such tests is illustrated with several case studies derived from Hanford groundwater monitoring programs. To avoid possible misuse of such data, an understanding of the limitations is needed. In addition to statistical test procedures, geochemical, and hydrologic considerations are integral parts of the decision process. For this purpose a phased approach is recommended that proceeds from simple to the more complex, and from an overview to detailed analysis

  20. Analyzing sickness absence with statistical models for survival data

    DEFF Research Database (Denmark)

    Christensen, Karl Bang; Andersen, Per Kragh; Smith-Hansen, Lars

    2007-01-01

    OBJECTIVES: Sickness absence is the outcome in many epidemiologic studies and is often based on summary measures such as the number of sickness absences per year. In this study the use of modern statistical methods was examined by making better use of the available information. Since sickness...... absence data deal with events occurring over time, the use of statistical models for survival data has been reviewed, and the use of frailty models has been proposed for the analysis of such data. METHODS: Three methods for analyzing data on sickness absences were compared using a simulation study...... involving the following: (i) Poisson regression using a single outcome variable (number of sickness absences), (ii) analysis of time to first event using the Cox proportional hazards model, and (iii) frailty models, which are random effects proportional hazards models. Data from a study of the relation...

  1. A statistical study on fracture toughness data of Japanese RPVS

    International Nuclear Information System (INIS)

    Sakai, Y.; Ogura, N.

    1987-01-01

    In a cooperative study for investigating fracture toughness on pressure vessel steels produced in Japan, a number of heats of ASTM A533B cl.1 and A508 cl.3 steels have been studied. Approximately 3000 fracture toughness data and 8000 mechanical properties data were obtained and filed in a computer data bank. Statistical characterization of toughness data in the transition region has been carried out using the computer data bank. Curve fitting technique for toughness data has been examined. Approach using the function to model the transition behaviours of each toughness has been applied. The aims of fitting curve technique were as follows; (1) Summarization of an enormous toughness data base to permit comparison heats, materials and testing methods; (2) Investigating the relationships among static, dynamic and arrest toughness; (3) Examining the ASME K(IR) curve statistically. The methodology used in this study for analyzing a large quantity of fracture toughness data was found to be useful for formulating a statistically based K(IR) curve. (orig./HP)

  2. Statistical methods to evaluate thermoluminescence ionizing radiation dosimetry data

    International Nuclear Information System (INIS)

    Segre, Nadia; Matoso, Erika; Fagundes, Rosane Correa

    2011-01-01

    Ionizing radiation levels, evaluated through the exposure of CaF 2 :Dy thermoluminescence dosimeters (TLD- 200), have been monitored at Centro Experimental Aramar (CEA), located at Ipero in Sao Paulo state, Brazil, since 1991 resulting in a large amount of measurements until 2009 (more than 2,000). The data amount associated with measurements dispersion, since every process has deviation, reinforces the utilization of statistical tools to evaluate the results, procedure also imposed by the Brazilian Standard CNEN-NN-3.01/PR- 3.01-008 which regulates the radiometric environmental monitoring. Thermoluminescence ionizing radiation dosimetry data are statistically compared in order to evaluate potential CEA's activities environmental impact. The statistical tools discussed in this work are box plots, control charts and analysis of variance. (author)

  3. Statistical data for the tensile properties of natural fibre composites

    Directory of Open Access Journals (Sweden)

    J.P. Torres

    2017-06-01

    Full Text Available This article features a large statistical database on the tensile properties of natural fibre reinforced composite laminates. The data presented here corresponds to a comprehensive experimental testing program of several composite systems including: different material constituents (epoxy and vinyl ester resins; flax, jute and carbon fibres, different fibre configurations (short-fibre mats, unidirectional, and plain, twill and satin woven fabrics and different fibre orientations (0°, 90°, and [0,90] angle plies. For each material, ~50 specimens were tested under uniaxial tensile loading. Here, we provide the complete set of stress–strain curves together with the statistical distributions of their calculated elastic modulus, strength and failure strain. The data is also provided as support material for the research article: “The mechanical properties of natural fibre composite laminates: A statistical study” [1].

  4. Statistical analysis and interpolation of compositional data in materials science.

    Science.gov (United States)

    Pesenson, Misha Z; Suram, Santosh K; Gregoire, John M

    2015-02-09

    Compositional data are ubiquitous in chemistry and materials science: analysis of elements in multicomponent systems, combinatorial problems, etc., lead to data that are non-negative and sum to a constant (for example, atomic concentrations). The constant sum constraint restricts the sampling space to a simplex instead of the usual Euclidean space. Since statistical measures such as mean and standard deviation are defined for the Euclidean space, traditional correlation studies, multivariate analysis, and hypothesis testing may lead to erroneous dependencies and incorrect inferences when applied to compositional data. Furthermore, composition measurements that are used for data analytics may not include all of the elements contained in the material; that is, the measurements may be subcompositions of a higher-dimensional parent composition. Physically meaningful statistical analysis must yield results that are invariant under the number of composition elements, requiring the application of specialized statistical tools. We present specifics and subtleties of compositional data processing through discussion of illustrative examples. We introduce basic concepts, terminology, and methods required for the analysis of compositional data and utilize them for the spatial interpolation of composition in a sputtered thin film. The results demonstrate the importance of this mathematical framework for compositional data analysis (CDA) in the fields of materials science and chemistry.

  5. Multivariate statistical analysis of major and trace element data for ...

    African Journals Online (AJOL)

    Multivariate statistical analysis of major and trace element data for niobium exploration in the peralkaline granites of the anorogenic ring-complex province of Nigeria. PO Ogunleye, EC Ike, I Garba. Abstract. No Abstract Available Journal of Mining and Geology Vol.40(2) 2004: 107-117. Full Text: EMAIL FULL TEXT EMAIL ...

  6. Exploring Foundation Concepts in Introductory Statistics Using Dynamic Data Points

    Science.gov (United States)

    Ekol, George

    2015-01-01

    This paper analyses introductory statistics students' verbal and gestural expressions as they interacted with a dynamic sketch (DS) designed using "Sketchpad" software. The DS involved numeric data points built on the number line whose values changed as the points were dragged along the number line. The study is framed on aggregate…

  7. Quick Access: Find Statistical Data on the Internet.

    Science.gov (United States)

    Su, Di

    1999-01-01

    Provides an annotated list of Internet sources (World Wide Web, ftp, and gopher sites) for current and historical statistical business data, including selected interest rates, the Consumer Price Index, the Producer Price Index, foreign currency exchange rates, noon buying rates, per diem rates, the special drawing right, stock quotes, and mutual…

  8. Data on education: from population statistics to epidemiological research

    DEFF Research Database (Denmark)

    Pallesen, Palle Bo; Tverborgvik, Torill; Rasmussen, Hanna Barbara

    2010-01-01

    BACKGROUND: Level of education is in many fields of research used as an indicator of social status. METHODS: Using Statistics Denmark's register for education and employment of the population, we examined highest completed education with a birth-cohort perspective focusing on people born between...... of population trends by use of extrapolated values, solutions are less obvious in epidemiological research using individual level data....

  9. The Use of Advanced Transportation Monitoring Data for Official Statistics

    NARCIS (Netherlands)

    Y. Ma (Yinyi)

    2016-01-01

    markdownabstractTraffic and transportation statistics are mainly published as aggregated information, and are traditionally based on surveys or secondary data sources, like public registers and companies’ administrations. Nowadays, advanced monitoring systems are installed in the road network, offering

  10. Applications of spatial statistical network models to stream data

    Science.gov (United States)

    Daniel J. Isaak; Erin E. Peterson; Jay M. Ver Hoef; Seth J. Wenger; Jeffrey A. Falke; Christian E. Torgersen; Colin Sowder; E. Ashley Steel; Marie-Josee Fortin; Chris E. Jordan; Aaron S. Ruesch; Nicholas Som; Pascal. Monestiez

    2014-01-01

    Streams and rivers host a significant portion of Earth's biodiversity and provide important ecosystem services for human populations. Accurate information regarding the status and trends of stream resources is vital for their effective conservation and management. Most statistical techniques applied to data measured on stream networks were developed for...

  11. Software for statistical data analysis used in Higgs searches

    International Nuclear Information System (INIS)

    Gumpert, Christian; Moneta, Lorenzo; Cranmer, Kyle; Kreiss, Sven; Verkerke, Wouter

    2014-01-01

    The analysis and interpretation of data collected by the Large Hadron Collider (LHC) requires advanced statistical tools in order to quantify the agreement between observation and theoretical models. RooStats is a project providing a statistical framework for data analysis with the focus on discoveries, confidence intervals and combination of different measurements in both Bayesian and frequentist approaches. It employs the RooFit data modelling language where mathematical concepts such as variables, (probability density) functions and integrals are represented as C++ objects. RooStats and RooFit rely on the persistency technology of the ROOT framework. The usage of a common data format enables the concept of digital publishing of complicated likelihood functions. The statistical tools have been developed in close collaboration with the LHC experiments to ensure their applicability to real-life use cases. Numerous physics results have been produced using the RooStats tools, with the discovery of the Higgs boson by the ATLAS and CMS experiments being certainly the most popular among them. We will discuss tools currently used by LHC experiments to set exclusion limits, to derive confidence intervals and to estimate discovery significances based on frequentist statistics and the asymptotic behaviour of likelihood functions. Furthermore, new developments in RooStats and performance optimisation necessary to cope with complex models depending on more than 1000 variables will be reviewed

  12. Statistical Physics in the Era of Big Data

    Science.gov (United States)

    Wang, Dashun

    2013-01-01

    With the wealth of data provided by a wide range of high-throughout measurement tools and technologies, statistical physics of complex systems is entering a new phase, impacting in a meaningful fashion a wide range of fields, from cell biology to computer science to economics. In this dissertation, by applying tools and techniques developed in…

  13. A statistical test for outlier identification in data envelopment analysis

    Directory of Open Access Journals (Sweden)

    Morteza Khodabin

    2010-09-01

    Full Text Available In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results. In these ‘‘deterministic’’ frontier models, statistical theory is now mostly available. This paper deals with the statistical pared sample method and its capability of detecting outliers in data envelopment analysis. In the presented method, each observation is deleted from the sample once and the resulting linear program is solved, leading to a distribution of efficiency estimates. Based on the achieved distribution, a pared test is designed to identify the potential outlier(s. We illustrate the method through a real data set. The method could be used in a first step, as an exploratory data analysis, before using any frontier estimation.

  14. Feature-Based Statistical Analysis of Combustion Simulation Data

    Energy Technology Data Exchange (ETDEWEB)

    Bennett, J; Krishnamoorthy, V; Liu, S; Grout, R; Hawkes, E; Chen, J; Pascucci, V; Bremer, P T

    2011-11-18

    We present a new framework for feature-based statistical analysis of large-scale scientific data and demonstrate its effectiveness by analyzing features from Direct Numerical Simulations (DNS) of turbulent combustion. Turbulent flows are ubiquitous and account for transport and mixing processes in combustion, astrophysics, fusion, and climate modeling among other disciplines. They are also characterized by coherent structure or organized motion, i.e. nonlocal entities whose geometrical features can directly impact molecular mixing and reactive processes. While traditional multi-point statistics provide correlative information, they lack nonlocal structural information, and hence, fail to provide mechanistic causality information between organized fluid motion and mixing and reactive processes. Hence, it is of great interest to capture and track flow features and their statistics together with their correlation with relevant scalar quantities, e.g. temperature or species concentrations. In our approach we encode the set of all possible flow features by pre-computing merge trees augmented with attributes, such as statistical moments of various scalar fields, e.g. temperature, as well as length-scales computed via spectral analysis. The computation is performed in an efficient streaming manner in a pre-processing step and results in a collection of meta-data that is orders of magnitude smaller than the original simulation data. This meta-data is sufficient to support a fully flexible and interactive analysis of the features, allowing for arbitrary thresholds, providing per-feature statistics, and creating various global diagnostics such as Cumulative Density Functions (CDFs), histograms, or time-series. We combine the analysis with a rendering of the features in a linked-view browser that enables scientists to interactively explore, visualize, and analyze the equivalent of one terabyte of simulation data. We highlight the utility of this new framework for combustion

  15. Statistical data on butane and kerosene in West Africa

    International Nuclear Information System (INIS)

    Masse, R.

    1990-01-01

    This book gives statistical, technical and economical informations on butane and kerosene used in West Africa in 1990. In a first part, informations on gas and gas using are given: market, energy efficiency, performance, safety, distribution, storage, transport and commercialization. Statistical data on petroleum and natural gas production or consumption are also described. Natural gas and petroleum reserves in Africa are also studied. In the second part, thirty country entries give an economic analysis of each african country. 21 figs., 19 tabs., 5 maps

  16. Data and statistical methods for analysis of trends and patterns

    International Nuclear Information System (INIS)

    Atwood, C.L.; Gentillon, C.D.; Wilson, G.E.

    1992-11-01

    This report summarizes topics considered at a working meeting on data and statistical methods for analysis of trends and patterns in US commercial nuclear power plants. This meeting was sponsored by the Office of Analysis and Evaluation of Operational Data (AEOD) of the Nuclear Regulatory Commission (NRC). Three data sets are briefly described: Nuclear Plant Reliability Data System (NPRDS), Licensee Event Report (LER) data, and Performance Indicator data. Two types of study are emphasized: screening studies, to see if any trends or patterns appear to be present; and detailed studies, which are more concerned with checking the analysis assumptions, modeling any patterns that are present, and searching for causes. A prescription is given for a screening study, and ideas are suggested for a detailed study, when the data take of any of three forms: counts of events per time, counts of events per demand, and non-event data

  17. STATISTICS. The reusable holdout: Preserving validity in adaptive data analysis.

    Science.gov (United States)

    Dwork, Cynthia; Feldman, Vitaly; Hardt, Moritz; Pitassi, Toniann; Reingold, Omer; Roth, Aaron

    2015-08-07

    Misapplication of statistical data analysis is a common cause of spurious discoveries in scientific research. Existing approaches to ensuring the validity of inferences drawn from data assume a fixed procedure to be performed, selected before the data are examined. In common practice, however, data analysis is an intrinsically adaptive process, with new analyses generated on the basis of data exploration, as well as the results of previous analyses on the same data. We demonstrate a new approach for addressing the challenges of adaptivity based on insights from privacy-preserving data analysis. As an application, we show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses. Copyright © 2015, American Association for the Advancement of Science.

  18. Measuring the data universe data integration using statistical data and metadata exchange

    CERN Document Server

    Stahl, Reinhold

    2018-01-01

    This richly illustrated book provides an easy-to-read introduction to the challenges of organizing and integrating modern data worlds, explaining the contribution of public statistics and the ISO standard SDMX (Statistical Data and Metadata Exchange). As such, it is a must for data experts as well those aspiring to become one. Today, exponentially growing data worlds are increasingly determining our professional and private lives. The rapid increase in the amount of globally available data, fueled by search engines and social networks but also by new technical possibilities such as Big Data, offers great opportunities. But whatever the undertaking – driving the block chain revolution or making smart phones even smarter – success will be determined by how well it is possible to integrate, i.e. to collect, link and evaluate, the required data. One crucial factor in this is the introduction of a cross-domain order system in combination with a standardization of the data structure. Using everyday examples, th...

  19. Statistical transformation and the interpretation of inpatient glucose control data.

    Science.gov (United States)

    Saulnier, George E; Castro, Janna C; Cook, Curtiss B

    2014-03-01

    To introduce a statistical method of assessing hospital-based non-intensive care unit (non-ICU) inpatient glucose control. Point-of-care blood glucose (POC-BG) data from hospital non-ICUs were extracted for January 1 through December 31, 2011. Glucose data distribution was examined before and after Box-Cox transformations and compared to normality. Different subsets of data were used to establish upper and lower control limits, and exponentially weighted moving average (EWMA) control charts were constructed from June, July, and October data as examples to determine if out-of-control events were identified differently in nontransformed versus transformed data. A total of 36,381 POC-BG values were analyzed. In all 3 monthly test samples, glucose distributions in nontransformed data were skewed but approached a normal distribution once transformed. Interpretation of out-of-control events from EWMA control chart analyses also revealed differences. In the June test data, an out-of-control process was identified at sample 53 with nontransformed data, whereas the transformed data remained in control for the duration of the observed period. Analysis of July data demonstrated an out-of-control process sooner in the transformed (sample 55) than nontransformed (sample 111) data, whereas for October, transformed data remained in control longer than nontransformed data. Statistical transformations increase the normal behavior of inpatient non-ICU glycemic data sets. The decision to transform glucose data could influence the interpretation and conclusions about the status of inpatient glycemic control. Further study is required to determine whether transformed versus nontransformed data influence clinical decisions or evaluation of interventions.

  20. Kappa statistic for clustered matched-pair data.

    Science.gov (United States)

    Yang, Zhao; Zhou, Ming

    2014-07-10

    Kappa statistic is widely used to assess the agreement between two procedures in the independent matched-pair data. For matched-pair data collected in clusters, on the basis of the delta method and sampling techniques, we propose a nonparametric variance estimator for the kappa statistic without within-cluster correlation structure or distributional assumptions. The results of an extensive Monte Carlo simulation study demonstrate that the proposed kappa statistic provides consistent estimation and the proposed variance estimator behaves reasonably well for at least a moderately large number of clusters (e.g., K ≥50). Compared with the variance estimator ignoring dependence within a cluster, the proposed variance estimator performs better in maintaining the nominal coverage probability when the intra-cluster correlation is fair (ρ ≥0.3), with more pronounced improvement when ρ is further increased. To illustrate the practical application of the proposed estimator, we analyze two real data examples of clustered matched-pair data. Copyright © 2014 John Wiley & Sons, Ltd.

  1. A note on the kappa statistic for clustered dichotomous data.

    Science.gov (United States)

    Zhou, Ming; Yang, Zhao

    2014-06-30

    The kappa statistic is widely used to assess the agreement between two raters. Motivated by a simulation-based cluster bootstrap method to calculate the variance of the kappa statistic for clustered physician-patients dichotomous data, we investigate its special correlation structure and develop a new simple and efficient data generation algorithm. For the clustered physician-patients dichotomous data, based on the delta method and its special covariance structure, we propose a semi-parametric variance estimator for the kappa statistic. An extensive Monte Carlo simulation study is performed to evaluate the performance of the new proposal and five existing methods with respect to the empirical coverage probability, root-mean-square error, and average width of the 95% confidence interval for the kappa statistic. The variance estimator ignoring the dependence within a cluster is generally inappropriate, and the variance estimators from the new proposal, bootstrap-based methods, and the sampling-based delta method perform reasonably well for at least a moderately large number of clusters (e.g., the number of clusters K ⩾50). The new proposal and sampling-based delta method provide convenient tools for efficient computations and non-simulation-based alternatives to the existing bootstrap-based methods. Moreover, the new proposal has acceptable performance even when the number of clusters is as small as K = 25. To illustrate the practical application of all the methods, one psychiatric research data and two simulated clustered physician-patients dichotomous data are analyzed. Copyright © 2014 John Wiley & Sons, Ltd.

  2. On the statistical assessment of classifiers using DNA microarray data

    Directory of Open Access Journals (Sweden)

    Carella M

    2006-08-01

    Full Text Available Abstract Background In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22 and tumor (25 specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data. Results We estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045 as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS and Support Vector Machines (SVM classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035 and e = 18% (p = 0.037 respectively. Moreover, the error rate

  3. Using statistical correlation to compare geomagnetic data sets

    Science.gov (United States)

    Stanton, T.

    2009-04-01

    The major features of data curves are often matched, to a first order, by bump and wiggle matching to arrive at an offset between data sets. This poster describes a simple statistical correlation program that has proved useful during this stage by determining the optimal correlation between geomagnetic curves using a variety of fixed and floating windows. Its utility is suggested by the fact that it is simple to run, yet generates meaningful data comparisons, often when data noise precludes the obvious matching of curve features. Data sets can be scaled, smoothed, normalised and standardised, before all possible correlations are carried out between selected overlapping portions of each curve. Best-fit offset curves can then be displayed graphically. The program was used to cross-correlate directional and palaeointensity data from Holocene lake sediments (Stanton et al., submitted) and Holocene lava flows. Some example curve matches are shown, including some that illustrate the potential of this technique when examining particularly sparse data sets. Stanton, T., Snowball, I., Zillén, L. and Wastegård, S., submitted. Detecting potential errors in varve chronology and 14C ages using palaeosecular variation curves, lead pollution history and statistical correlation. Quaternary Geochronology.

  4. Applied systems ecology: models, data, and statistical methods

    Energy Technology Data Exchange (ETDEWEB)

    Eberhardt, L L

    1976-01-01

    In this report, systems ecology is largely equated to mathematical or computer simulation modelling. The need for models in ecology stems from the necessity to have an integrative device for the diversity of ecological data, much of which is observational, rather than experimental, as well as from the present lack of a theoretical structure for ecology. Different objectives in applied studies require specialized methods. The best predictive devices may be regression equations, often non-linear in form, extracted from much more detailed models. A variety of statistical aspects of modelling, including sampling, are discussed. Several aspects of population dynamics and food-chain kinetics are described, and it is suggested that the two presently separated approaches should be combined into a single theoretical framework. It is concluded that future efforts in systems ecology should emphasize actual data and statistical methods, as well as modelling.

  5. Statistics in experimental design, preprocessing, and analysis of proteomics data.

    Science.gov (United States)

    Jung, Klaus

    2011-01-01

    High-throughput experiments in proteomics, such as 2-dimensional gel electrophoresis (2-DE) and mass spectrometry (MS), yield usually high-dimensional data sets of expression values for hundreds or thousands of proteins which are, however, observed on only a relatively small number of biological samples. Statistical methods for the planning and analysis of experiments are important to avoid false conclusions and to receive tenable results. In this chapter, the most frequent experimental designs for proteomics experiments are illustrated. In particular, focus is put on studies for the detection of differentially regulated proteins. Furthermore, issues of sample size planning, statistical analysis of expression levels as well as methods for data preprocessing are covered.

  6. Statistical methods for data analysis in particle physics

    CERN Document Server

    AUTHOR|(CDS)2070643

    2015-01-01

    This concise set of course-based notes provides the reader with the main concepts and tools to perform statistical analysis of experimental data, in particular in the field of high-energy physics (HEP). First, an introduction to probability theory and basic statistics is given, mainly as reminder from advanced undergraduate studies, yet also in view to clearly distinguish the Frequentist versus Bayesian approaches and interpretations in subsequent applications. More advanced concepts and applications are gradually introduced, culminating in the chapter on upper limits as many applications in HEP concern hypothesis testing, where often the main goal is to provide better and better limits so as to be able to distinguish eventually between competing hypotheses or to rule out some of them altogether. Many worked examples will help newcomers to the field and graduate students to understand the pitfalls in applying theoretical concepts to actual data

  7. Patterns of ureteral motion: Data compression and statistics

    International Nuclear Information System (INIS)

    Mueller-Schauenburg, W.

    1981-01-01

    Images of ureteral peristaltics (ureteral kinetography) have been recorded at Tuebingen University Hospital since 1978. These images give a synoptical picture of ureteral motion in highly compressed form. Possibilities of data compression are discussed on the basis of functional path-time images, the ROI series, the in the path-time matrix, and the background subtraction. Particular attention is paid to problems of urethral activity statistics. (WU) [de

  8. Statistical Approaches to Assess Biosimilarity from Analytical Data.

    Science.gov (United States)

    Burdick, Richard; Coffey, Todd; Gutka, Hiten; Gratzl, Gyöngyi; Conlon, Hugh D; Huang, Chi-Ting; Boyne, Michael; Kuehne, Henriette

    2017-01-01

    Protein therapeutics have unique critical quality attributes (CQAs) that define their purity, potency, and safety. The analytical methods used to assess CQAs must be able to distinguish clinically meaningful differences in comparator products, and the most important CQAs should be evaluated with the most statistical rigor. High-risk CQA measurements assess the most important attributes that directly impact the clinical mechanism of action or have known implications for safety, while the moderate- to low-risk characteristics may have a lower direct impact and thereby may have a broader range to establish similarity. Statistical equivalence testing is applied for high-risk CQA measurements to establish the degree of similarity (e.g., highly similar fingerprint, highly similar, or similar) of selected attributes. Notably, some high-risk CQAs (e.g., primary sequence or disulfide bonding) are qualitative (e.g., the same as the originator or not the same) and therefore not amenable to equivalence testing. For biosimilars, an important step is the acquisition of a sufficient number of unique originator drug product lots to measure the variability in the originator drug manufacturing process and provide sufficient statistical power for the analytical data comparisons. Together, these analytical evaluations, along with PK/PD and safety data (immunogenicity), provide the data necessary to determine if the totality of the evidence warrants a designation of biosimilarity and subsequent licensure for marketing in the USA. In this paper, a case study approach is used to provide examples of analytical similarity exercises and the appropriateness of statistical approaches for the example data.

  9. Statistical Challenges of Big Data Analysis in Medicine

    Czech Academy of Sciences Publication Activity Database

    Kalina, Jan

    2015-01-01

    Roč. 3, č. 1 (2015), s. 24-27 ISSN 1805-8698 R&D Projects: GA ČR GA13-23940S Grant - others:CESNET Development Fund(CZ) 494/2013 Institutional support: RVO:67985807 Keywords : big data * variable selection * classification * cluster analysis Subject RIV: BB - Applied Statistics, Operational Research http://www.ijbh.org/ijbh2015-1.pdf

  10. Maximum Likelihood, Consistency and Data Envelopment Analysis: A Statistical Foundation

    OpenAIRE

    Rajiv D. Banker

    1993-01-01

    This paper provides a formal statistical basis for the efficiency evaluation techniques of data envelopment analysis (DEA). DEA estimators of the best practice monotone increasing and concave production function are shown to be also maximum likelihood estimators if the deviation of actual output from the efficient output is regarded as a stochastic variable with a monotone decreasing probability density function. While the best practice frontier estimator is biased below the theoretical front...

  11. Analysis of spectral data with rare events statistics

    International Nuclear Information System (INIS)

    Ilyushchenko, V.I.; Chernov, N.I.

    1990-01-01

    The case is considered of analyzing experimental data, when the results of individual experimental runs cannot be summed due to large systematic errors. A statistical analysis of the hypothesis about the persistent peaks in the spectra has been performed by means of the Neyman-Pearson test. The computations demonstrate the confidence level for the hypothesis about the presence of a persistent peak in the spectrum is proportional to the square root of the number of independent experimental runs, K. 5 refs

  12. SAS and R data management, statistical analysis, and graphics

    CERN Document Server

    Kleinman, Ken

    2009-01-01

    An All-in-One Resource for Using SAS and R to Carry out Common TasksProvides a path between languages that is easier than reading complete documentationSAS and R: Data Management, Statistical Analysis, and Graphics presents an easy way to learn how to perform an analytical task in both SAS and R, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation. The book covers many common tasks, such as data management, descriptive summaries, inferential procedures, regression analysis, and the creation of graphics, along with more complex applicat

  13. Using R for Data Management, Statistical Analysis, and Graphics

    CERN Document Server

    Horton, Nicholas J

    2010-01-01

    This title offers quick and easy access to key element of documentation. It includes worked examples across a wide variety of applications, tasks, and graphics. "Using R for Data Management, Statistical Analysis, and Graphics" presents an easy way to learn how to perform an analytical task in R, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation and vast number of add-on packages. Organized by short, clear descriptive entries, the book covers many common tasks, such as data management, descriptive summaries, inferential proc

  14. Theoretical, analytical, and statistical interpretation of environmental data

    International Nuclear Information System (INIS)

    Lombard, S.M.

    1974-01-01

    The reliability of data from radiochemical analyses of environmental samples cannot be determined from nuclear counting statistics alone. The rigorous application of the principles of propagation of errors, an understanding of the physics and chemistry of the species of interest in the environment, and the application of information from research on the analytical procedure are all necessary for a valid estimation of the errors associated with analytical results. The specific case of the determination of plutonium in soil is considered in terms of analytical problems and data reliability. (U.S.)

  15. Accidents in Malaysian construction industry: statistical data and court cases.

    Science.gov (United States)

    Chong, Heap Yih; Low, Thuan Siang

    2014-01-01

    Safety and health issues remain critical to the construction industry due to its working environment and the complexity of working practises. This research attempts to adopt 2 research approaches using statistical data and court cases to address and identify the causes and behavior underlying construction safety and health issues in Malaysia. Factual data on the period of 2000-2009 were retrieved to identify the causes and agents that contributed to health issues. Moreover, court cases were tabulated and analyzed to identify legal patterns of parties involved in construction site accidents. Approaches of this research produced consistent results and highlighted a significant reduction in the rate of accidents per construction project in Malaysia.

  16. A Statistical Toolbox For Mining And Modeling Spatial Data

    Directory of Open Access Journals (Sweden)

    D’Aubigny Gérard

    2016-12-01

    Full Text Available Most data mining projects in spatial economics start with an evaluation of a set of attribute variables on a sample of spatial entities, looking for the existence and strength of spatial autocorrelation, based on the Moran’s and the Geary’s coefficients, the adequacy of which is rarely challenged, despite the fact that when reporting on their properties, many users seem likely to make mistakes and to foster confusion. My paper begins by a critical appraisal of the classical definition and rational of these indices. I argue that while intuitively founded, they are plagued by an inconsistency in their conception. Then, I propose a principled small change leading to corrected spatial autocorrelation coefficients, which strongly simplifies their relationship, and opens the way to an augmented toolbox of statistical methods of dimension reduction and data visualization, also useful for modeling purposes. A second section presents a formal framework, adapted from recent work in statistical learning, which gives theoretical support to our definition of corrected spatial autocorrelation coefficients. More specifically, the multivariate data mining methods presented here, are easily implementable on the existing (free software, yield methods useful to exploit the proposed corrections in spatial data analysis practice, and, from a mathematical point of view, whose asymptotic behavior, already studied in a series of papers by Belkin & Niyogi, suggests that they own qualities of robustness and a limited sensitivity to the Modifiable Areal Unit Problem (MAUP, valuable in exploratory spatial data analysis.

  17. Solar radiation data - statistical analysis and simulation models

    Energy Technology Data Exchange (ETDEWEB)

    Mustacchi, C; Cena, V; Rocchi, M; Haghigat, F

    1984-01-01

    The activities consisted in collecting meteorological data on magnetic tape for ten european locations (with latitudes ranging from 42/sup 0/ to 56/sup 0/ N), analysing the multi-year sequences, developing mathematical models to generate synthetic sequences having the same statistical properties of the original data sets, and producing one or more Short Reference Years (SRY's) for each location. The meteorological parameters examinated were (for all the locations) global + diffuse radiation on horizontal surface, dry bulb temperature, sunshine duration. For some of the locations additional parameters were available, namely, global, beam and diffuse radiation on surfaces other than horizontal, wet bulb temperature, wind velocity, cloud type, cloud cover. The statistical properties investigated were mean, variance, autocorrelation, crosscorrelation with selected parameters, probability density function. For all the meteorological parameters, various mathematical models were built: linear regression, stochastic models of the AR and the DAR type. In each case, the model with the best statistical behaviour was selected for the production of a SRY for the relevant parameter/location.

  18. Statistical distributions as applied to environmental surveillance data

    International Nuclear Information System (INIS)

    Speer, D.R.; Waite, D.A.

    1976-01-01

    Application of normal, lognormal, and Weibull distributions to radiological environmental surveillance data was investigated for approximately 300 nuclide-medium-year-location combinations. The fit of data to distributions was compared through probability plotting (special graph paper provides a visual check) and W test calculations. Results show that 25% of the data fit the normal distribution, 50% fit the lognormal, and 90% fit the Weibull.Demonstration of how to plot each distribution shows that normal and lognormal distributions are comparatively easy to use while Weibull distribution is complicated and difficult to use. Although current practice is to use normal distribution statistics, normal fit the least number of data groups considered in this study

  19. Outpatient health care statistics data warehouse--implementation.

    Science.gov (United States)

    Zilli, D

    1999-01-01

    Data warehouse implementation is assumed to be a very knowledge-demanding, expensive and long-lasting process. As such it requires senior management sponsorship, involvement of experts, a big budget and probably years of development time. Presented Outpatient Health Care Statistics Data Warehouse implementation research provides ample evidence against the infallibility of the above statements. New, inexpensive, but powerful technology, which provides outstanding platform for On-Line Analytical Processing (OLAP), has emerged recently. Presumably, it will be the basis for the estimated future growth of data warehouse market, both in the medical and in other business fields. Methods and tools for building, maintaining and exploiting data warehouses are also briefly discussed in the paper.

  20. Statistical methods for data analysis in particle physics

    CERN Document Server

    Lista, Luca

    2017-01-01

    This concise set of course-based notes provides the reader with the main concepts and tools needed to perform statistical analyses of experimental data, in particular in the field of high-energy physics (HEP). First, the book provides an introduction to probability theory and basic statistics, mainly intended as a refresher from readers’ advanced undergraduate studies, but also to help them clearly distinguish between the Frequentist and Bayesian approaches and interpretations in subsequent applications. More advanced concepts and applications are gradually introduced, culminating in the chapter on both discoveries and upper limits, as many applications in HEP concern hypothesis testing, where the main goal is often to provide better and better limits so as to eventually be able to distinguish between competing hypotheses, or to rule out some of them altogether. Many worked-out examples will help newcomers to the field and graduate students alike understand the pitfalls involved in applying theoretical co...

  1. Explorations in statistics: the analysis of ratios and normalized data.

    Science.gov (United States)

    Curran-Everett, Douglas

    2013-09-01

    Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This ninth installment of Explorations in Statistics explores the analysis of ratios and normalized-or standardized-data. As researchers, we compute a ratio-a numerator divided by a denominator-to compute a proportion for some biological response or to derive some standardized variable. In each situation, we want to control for differences in the denominator when the thing we really care about is the numerator. But there is peril lurking in a ratio: only if the relationship between numerator and denominator is a straight line through the origin will the ratio be meaningful. If not, the ratio will misrepresent the true relationship between numerator and denominator. In contrast, regression techniques-these include analysis of covariance-are versatile: they can accommodate an analysis of the relationship between numerator and denominator when a ratio is useless.

  2. Common misconceptions about data analysis and statistics1

    Science.gov (United States)

    Motulsky, Harvey J

    2015-01-01

    Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason may be that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: (1) P-Hacking. This is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want. (2) Overemphasis on P values rather than on the actual size of the observed effect. (3) Overuse of statistical hypothesis testing, and being seduced by the word “significant”. (4) Overreliance on standard errors, which are often misunderstood. PMID:25692012

  3. Statistical analysis of field data for aircraft warranties

    Science.gov (United States)

    Lakey, Mary J.

    Air Force and Navy maintenance data collection systems were researched to determine their scientific applicability to the warranty process. New and unique algorithms were developed to extract failure distributions which were then used to characterize how selected families of equipment typically fails. Families of similar equipment were identified in terms of function, technology and failure patterns. Statistical analyses and applications such as goodness-of-fit test, maximum likelihood estimation and derivation of confidence intervals for the probability density function parameters were applied to characterize the distributions and their failure patterns. Statistical and reliability theory, with relevance to equipment design and operational failures were also determining factors in characterizing the failure patterns of the equipment families. Inferences about the families with relevance to warranty needs were then made.

  4. Statistical mechanics of complex neural systems and high dimensional data

    International Nuclear Information System (INIS)

    Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya

    2013-01-01

    Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks. (paper)

  5. Statistical analysis of hydrologic data for Yucca Mountain

    International Nuclear Information System (INIS)

    Rutherford, B.M.; Hall, I.J.; Peters, R.R.; Easterling, R.G.; Klavetter, E.A.

    1992-02-01

    The geologic formations in the unsaturated zone at Yucca Mountain are currently being studied as the host rock for a potential radioactive waste repository. Data from several drill holes have been collected to provide the preliminary information needed for planning site characterization for the Yucca Mountain Project. Hydrologic properties have been measured on the core samples and the variables analyzed here are thought to be important in the determination of groundwater travel times. This report presents a statistical analysis of four hydrologic variables: saturated-matrix hydraulic conductivity, maximum moisture content, suction head, and calculated groundwater travel time. It is important to modelers to have as much information about the distribution of values of these variables as can be obtained from the data. The approach taken in this investigation is to (1) identify regions at the Yucca Mountain site that, according to the data, are distinctly different; (2) estimate the means and variances within these regions; (3) examine the relationships among the variables; and (4) investigate alternative statistical methods that might be applicable when more data become available. The five different functional stratigraphic units at three different locations are compared and grouped into relatively homogeneous regions. Within these regions, the expected values and variances associated with core samples of different sizes are estimated. The results provide a rough estimate of the distribution of hydrologic variables for small core sections within each region

  6. Metaviz: interactive statistical and visual analysis of metagenomic data.

    Science.gov (United States)

    Wagner, Justin; Chelaru, Florin; Kancherla, Jayaram; Paulson, Joseph N; Zhang, Alexander; Felix, Victor; Mahurkar, Anup; Elmqvist, Niklas; Corrada Bravo, Héctor

    2018-04-06

    Large studies profiling microbial communities and their association with healthy or disease phenotypes are now commonplace. Processed data from many of these studies are publicly available but significant effort is required for users to effectively organize, explore and integrate it, limiting the utility of these rich data resources. Effective integrative and interactive visual and statistical tools to analyze many metagenomic samples can greatly increase the value of these data for researchers. We present Metaviz, a tool for interactive exploratory data analysis of annotated microbiome taxonomic community profiles derived from marker gene or whole metagenome shotgun sequencing. Metaviz is uniquely designed to address the challenge of browsing the hierarchical structure of metagenomic data features while rendering visualizations of data values that are dynamically updated in response to user navigation. We use Metaviz to provide the UMD Metagenome Browser web service, allowing users to browse and explore data for more than 7000 microbiomes from published studies. Users can also deploy Metaviz as a web service, or use it to analyze data through the metavizr package to interoperate with state-of-the-art analysis tools available through Bioconductor. Metaviz is free and open source with the code, documentation and tutorials publicly accessible.

  7. Statistical Approaches Accomodating Uncertainty in Modern Genomic Data

    DEFF Research Database (Denmark)

    Skotte, Line

    the contributed method applicable to case-control studies as well as mapping of quantitative traits. The contributed method provides a needed association test for quantitative traits in the presence of uncertain genotypes and it further allows correction for population structure in association tests for disease...... the potential of the technological advances. The first of the four papers included in this thesis describes a new method for association mapping that accommodates uncertain genotypes from low-coverage re-sequencing data. The method allows uncertain genotypes using a score statistic based on the joint likelihood...... of the observed phenotypes and the observed sequencing data. This joint likelihood accounts for the genotype uncertainties via the posterior probabilities of each genotype given the observed sequencing data and the phenotype distributions are modelled using a generalised linear model framework which makes...

  8. Statistical analysis of environmental dose data for Trombay environment

    International Nuclear Information System (INIS)

    Kale, M.S.; Padmanabhan, N.; Rekha Kutty, R.; Sharma, D.N.; Iyengar, T.S.; Iyer, M.R.

    1993-01-01

    The microprocessor based environmental dose logging system is functioning at six stations at Trombay for the past couple of years. The site emergency control centre (SECC) at modular laboratory receives telemetered data every five minutes from main guard house (South Site), Bhabha point (top of the hill), Cirus reactor, Mod Lab terrace, Hall No. 7 and Training School Hostel. The data collected are being stored in dbase III + format for easy processing in a PC. Various statistical parameters and distributions of environmental gamma dose are determined from the hourly dose data. On the basis of the reactor operation status an attempt has been made to separate the natural background and the gamma dose contribution due to the operating research reactors in each one of these monitoring stations. Similar investigations are being carried out for Tarapur environment. (author). 2 refs., 3 tabs., 2 figs

  9. Summary Statistics for Homemade ?Play Dough? -- Data Acquired at LLNL

    Energy Technology Data Exchange (ETDEWEB)

    Kallman, J S; Morales, K E; Whipple, R E; Huber, R D; Martz, A; Brown, W D; Smith, J A; Schneberk, D J; Martz, Jr., H E; White, III, W T

    2010-03-11

    Using x-ray computerized tomography (CT), we have characterized the x-ray linear attenuation coefficients (LAC) of a homemade Play Dough{trademark}-like material, designated as PDA. Table 1 gives the first-order statistics for each of four CT measurements, estimated with a Gaussian kernel density estimator (KDE) analysis. The mean values of the LAC range from a high of about 2700 LMHU{sub D} 100kVp to a low of about 1200 LMHUD at 300kVp. The standard deviation of each measurement is around 10% to 15% of the mean. The entropy covers the range from 6.0 to 7.4. Ordinarily, we would model the LAC of the material and compare the modeled values to the measured values. In this case, however, we did not have the detailed chemical composition of the material and therefore did not model the LAC. Using a method recently proposed by Lawrence Livermore National Laboratory (LLNL), we estimate the value of the effective atomic number, Z{sub eff}, to be near 10. LLNL prepared about 50mL of the homemade 'Play Dough' in a polypropylene vial and firmly compressed it immediately prior to the x-ray measurements. We used the computer program IMGREC to reconstruct the CT images. The values of the key parameters used in the data capture and image reconstruction are given in this report. Additional details may be found in the experimental SOP and a separate document. To characterize the statistical distribution of LAC values in each CT image, we first isolated an 80% central-core segment of volume elements ('voxels') lying completely within the specimen, away from the walls of the polypropylene vial. All of the voxels within this central core, including those comprised of voids and inclusions, are included in the statistics. We then calculated the mean value, standard deviation and entropy for (a) the four image segments and for (b) their digital gradient images. (A digital gradient image of a given image was obtained by taking the absolute value of the difference

  10. Statistical significance of epidemiological data. Seminar: Evaluation of epidemiological studies

    International Nuclear Information System (INIS)

    Weber, K.H.

    1993-01-01

    In stochastic damages, the numbers of events, e.g. the persons who are affected by or have died of cancer, and thus the relative frequencies (incidence or mortality) are binomially distributed random variables. Their statistical fluctuations can be characterized by confidence intervals. For epidemiologic questions, especially for the analysis of stochastic damages in the low dose range, the following issues are interesting: - Is a sample (a group of persons) with a definite observed damage frequency part of the whole population? - Is an observed frequency difference between two groups of persons random or statistically significant? - Is an observed increase or decrease of the frequencies with increasing dose random or statistically significant and how large is the regression coefficient (= risk coefficient) in this case? These problems can be solved by sttistical tests. So-called distribution-free tests and tests which are not bound to the supposition of normal distribution are of particular interest, such as: - χ 2 -independence test (test in contingency tables); - Fisher-Yates-test; - trend test according to Cochran; - rank correlation test given by Spearman. These tests are explained in terms of selected epidemiologic data, e.g. of leukaemia clusters, of the cancer mortality of the Japanese A-bomb survivors especially in the low dose range as well as on the sample of the cancer mortality in the high background area in Yangjiang (China). (orig.) [de

  11. Inferential Statistics from Black Hispanic Breast Cancer Survival Data

    Directory of Open Access Journals (Sweden)

    Hafiz M. R. Khan

    2014-01-01

    Full Text Available In this paper we test the statistical probability models for breast cancer survival data for race and ethnicity. Data was collected from breast cancer patients diagnosed in United States during the years 1973–2009. We selected a stratified random sample of Black Hispanic female patients from the Surveillance Epidemiology and End Results (SEER database to derive the statistical probability models. We used three common model building criteria which include Akaike Information Criteria (AIC, Bayesian Information Criteria (BIC, and Deviance Information Criteria (DIC to measure the goodness of fit tests and it was found that Black Hispanic female patients survival data better fit the exponentiated exponential probability model. A novel Bayesian method was used to derive the posterior density function for the model parameters as well as to derive the predictive inference for future response. We specifically focused on Black Hispanic race. Markov Chain Monte Carlo (MCMC method was used for obtaining the summary results of posterior parameters. Additionally, we reported predictive intervals for future survival times. These findings would be of great significance in treatment planning and healthcare resource allocation.

  12. Application of statistical dynamical turbulence closures to data assimilation

    International Nuclear Information System (INIS)

    O'Kane, Terence J; Frederiksen, Jorgen S

    2010-01-01

    We describe the development of an accurate yet computationally tractable statistical dynamical closure theory for general inhomogeneous turbulent flows, coined the quasi-diagonal direct interaction approximation closure (QDIA), and its application to problems in data assimilation. The QDIA provides prognostic equations for evolving mean fields, covariances and higher-order non-Gaussian terms, all of which are also required in the formulation of data assimilation schemes for nonlinear geophysical flows. The QDIA is a generalization of the class of direct interaction approximation theories, initially developed by Kraichnan (1959 J. Fluid Mech. 5 497) for isotropic turbulence, to fully inhomogeneous flows and has been further generalized to allow for both inhomogeneous and non-Gaussian initial conditions and long integrations. A regularization procedure or empirical vertex renormalization that ensures correct inertial range spectra is also described. The aim of this paper is to provide a coherent mathematical description of the QDIA turbulence closure and closure-based data assimilation scheme we have labeled the statistical dynamical Kalman filter. The mathematical formalism presented has been synthesized from recent works of the authors with some additional material and is presented in sufficient detail that the paper is of a pedagogical nature.

  13. A statistically self-consistent type Ia supernova data analysis

    International Nuclear Information System (INIS)

    Lago, B.L.; Calvao, M.O.; Joras, S.E.; Reis, R.R.R.; Waga, I.; Giostri, R.

    2011-01-01

    Full text: The type Ia supernovae are one of the main cosmological probes nowadays and are used as standardized candles in distance measurements. The standardization processes, among which SALT2 and MLCS2k2 are the most used ones, are based on empirical relations and leave room for a residual dispersion in the light curves of the supernovae. This dispersion is introduced in the chi squared used to fit the parameters of the model in the expression for the variance of the data, as an attempt to quantify our ignorance in modeling the supernovae properly. The procedure used to assign a value to this dispersion is statistically inconsistent and excludes the possibility of comparing different cosmological models. In addition, the SALT2 light curve fitter introduces parameters on the model for the variance that are also used in the model for the data. In the chi squared statistics context the minimization of such a quantity yields, in the best case scenario, a bias. An iterative method has been developed in order to perform the minimization of this chi squared but it is not well grounded, although it is used by several groups. We propose an analysis of the type Ia supernovae data that is based on the likelihood itself and makes it possible to address both inconsistencies mentioned above in a straightforward way. (author)

  14. The United Nations recommendations and data efforts: international migration statistics.

    Science.gov (United States)

    Simmons, A B

    1987-01-01

    This article reviews the UN's efforts to improve international migration statistics. The review addresses the challenges faced by the UN, the direction in which this effort is going, gaps in the current approach, and priorities for future action. The content of the UN recommendations has changed in the past and seems to be moving toward further changes. At each stage, the direction of change corresponds broadly to earlier shifts in the overall context of world social-economic affairs and related transformations in international travel and migration patterns. Early (1953) objectives were vaguely stated in terms of social, economic, and demographic impacts of long term settlement. 1976 recommendations continued the focus on long term resettlement and, at the same time, gave more attention to at least 1 kind of short term (work-related) movement. Most recent recommendations have given more attention to other classes of short term travellers, such as refugees and contract workers. Recommendations on the measures and data sources have changed over time, also. The 1953 recommendations were limited to flow data from international border statistics. 1976 recommendations drew attention to stock data and the use of civil registration data to supplement border crossing data. Recent UN reflections recognize that the volume of border crossings has now reached the point where many countries simply refuse to gather data on all travellers, choosing instead to make estimates. It is implied that either sample surveys at border points and/or visas and entry permits may be the best way of counting various specific kinds of migrants. Future recommendations corresponding to contemporary and emerging concerns will require that the guidelines be restructured: 1) to give more explicit attention in international migration statistics to citizenship and access to political and welfare benefits; 2) to distinguish more carefully various sub-classes of movers; 3) to expand objectives of data

  15. Information gathering for the Transportation Statistics Data Bank

    International Nuclear Information System (INIS)

    Shappert, L.B.; Mason, P.J.

    1981-10-01

    The Transportation Statistics Data Bank (TSDB) was developed in 1974 to collect information on the transport of Department of Energy (DOE) materials. This computer program may be used to provide the framework for collecting more detailed information on DOE shipments of radioactive materials. This report describes the type of information that is needed in this area and concludes that the existing system could be readily modified to collect and process it. The additional needed information, available from bills of lading and similar documents, could be gathered from DOE field offices and transferred in a standard format to the TSDB system. Costs of the system are also discussed briefly

  16. Evaluation of the Wishart test statistics for polarimetric SAR data

    DEFF Research Database (Denmark)

    Skriver, Henning; Nielsen, Allan Aasbjerg; Conradsen, Knut

    2003-01-01

    A test statistic for equality of two covariance matrices following the complex Wishart distribution has previously been used in new algorithms for change detection, edge detection and segmentation in polarimetric SAR images. Previously, the results for change detection and edge detection have been...... quantitatively evaluated. This paper deals with the evaluation of segmentation. A segmentation performance measure originally developed for single-channel SAR images has been extended to polarimetric SAR images, and used to evaluate segmentation for a merge-using-moment algorithm for polarimetric SAR data....

  17. JAWS data collection, analysis highlights, and microburst statistics

    Science.gov (United States)

    Mccarthy, J.; Roberts, R.; Schreiber, W.

    1983-01-01

    Organization, equipment, and the current status of the Joint Airport Weather Studies project initiated in relation to the microburst phenomenon are summarized. Some data collection techniques and preliminary statistics on microburst events recorded by Doppler radar are discussed as well. Radar studies show that microbursts occur much more often than expected, with majority of the events being potentially dangerous to landing or departing aircraft. Seventy events were registered, with the differential velocities ranging from 10 to 48 m/s; headwind/tailwind velocity differentials over 20 m/s are considered seriously hazardous. It is noted that a correlation is yet to be established between the velocity differential and incoherent radar reflectivity.

  18. Isocount scintillation scanner with preset statistical data reliability

    International Nuclear Information System (INIS)

    Ikebe, J.; Yamaguchi, H.; Nawa, O.A.

    1975-01-01

    A scintillation detector scans an object such as a live body along horizontal straight scanning lines in such a manner that the scintillation detector is stopped at a scanning point during the time interval T required for counting a predetermined number of N pulses. The rate R/sub N/ = N/T is then calculated and the output signal pulses the number of which represents the rate R or the corresponding output signal is used as the recording signal for forming the scintigram. In contrast to the usual scanner, the isocount scanner scans an object stepwise in order to gather data with statistically uniform reliability

  19. Data analysis for radiological characterisation: Geostatistical and statistical complementarity

    International Nuclear Information System (INIS)

    Desnoyers, Yvon; Dubot, Didier

    2012-01-01

    Radiological characterisation may cover a large range of evaluation objectives during a decommissioning and dismantling (D and D) project: removal of doubt, delineation of contaminated materials, monitoring of the decontamination work and final survey. At each stage, collecting relevant data to be able to draw the conclusions needed is quite a big challenge. In particular two radiological characterisation stages require an advanced sampling process and data analysis, namely the initial categorization and optimisation of the materials to be removed and the final survey to demonstrate compliance with clearance levels. On the one hand the latter is widely used and well developed in national guides and norms, using random sampling designs and statistical data analysis. On the other hand a more complex evaluation methodology has to be implemented for the initial radiological characterisation, both for sampling design and for data analysis. The geostatistical framework is an efficient way to satisfy the radiological characterisation requirements providing a sound decision-making approach for the decommissioning and dismantling of nuclear premises. The relevance of the geostatistical methodology relies on the presence of a spatial continuity for radiological contamination. Thus geo-statistics provides reliable methods for activity estimation, uncertainty quantification and risk analysis, leading to a sound classification of radiological waste (surfaces and volumes). This way, the radiological characterization of contaminated premises can be divided into three steps. First, the most exhaustive facility analysis provides historical and qualitative information. Then, a systematic (exhaustive or not) surface survey of the contamination is implemented on a regular grid. Finally, in order to assess activity levels and contamination depths, destructive samples are collected at several locations within the premises (based on the surface survey results) and analysed. Combined with

  20. Bayesian Sensitivity Analysis of Statistical Models with Missing Data.

    Science.gov (United States)

    Zhu, Hongtu; Ibrahim, Joseph G; Tang, Niansheng

    2014-04-01

    Methods for handling missing data depend strongly on the mechanism that generated the missing values, such as missing completely at random (MCAR) or missing at random (MAR), as well as other distributional and modeling assumptions at various stages. It is well known that the resulting estimates and tests may be sensitive to these assumptions as well as to outlying observations. In this paper, we introduce various perturbations to modeling assumptions and individual observations, and then develop a formal sensitivity analysis to assess these perturbations in the Bayesian analysis of statistical models with missing data. We develop a geometric framework, called the Bayesian perturbation manifold, to characterize the intrinsic structure of these perturbations. We propose several intrinsic influence measures to perform sensitivity analysis and quantify the effect of various perturbations to statistical models. We use the proposed sensitivity analysis procedure to systematically investigate the tenability of the non-ignorable missing at random (NMAR) assumption. Simulation studies are conducted to evaluate our methods, and a dataset is analyzed to illustrate the use of our diagnostic measures.

  1. Right-sizing statistical models for longitudinal data.

    Science.gov (United States)

    Wood, Phillip K; Steinley, Douglas; Jackson, Kristina M

    2015-12-01

    Arguments are proposed that researchers using longitudinal data should consider more and less complex statistical model alternatives to their initially chosen techniques in an effort to "right-size" the model to the data at hand. Such model comparisons may alert researchers who use poorly fitting, overly parsimonious models to more complex, better-fitting alternatives and, alternatively, may identify more parsimonious alternatives to overly complex (and perhaps empirically underidentified and/or less powerful) statistical models. A general framework is proposed for considering (often nested) relationships between a variety of psychometric and growth curve models. A 3-step approach is proposed in which models are evaluated based on the number and patterning of variance components prior to selection of better-fitting growth models that explain both mean and variation-covariation patterns. The orthogonal free curve slope intercept (FCSI) growth model is considered a general model that includes, as special cases, many models, including the factor mean (FM) model (McArdle & Epstein, 1987), McDonald's (1967) linearly constrained factor model, hierarchical linear models (HLMs), repeated-measures multivariate analysis of variance (MANOVA), and the linear slope intercept (linearSI) growth model. The FCSI model, in turn, is nested within the Tuckerized factor model. The approach is illustrated by comparing alternative models in a longitudinal study of children's vocabulary and by comparing several candidate parametric growth and chronometric models in a Monte Carlo study. (c) 2015 APA, all rights reserved).

  2. Assessing Research Data Deposits and Usage Statistics within IDEALS

    Directory of Open Access Journals (Sweden)

    Christie A. Wiley

    2017-12-01

    Full Text Available Objectives:This study follows up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1 What is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign (UIUC campus repository? Are datasets more likely to be single-file or multiple-file items? (2 What is the usage data associated with these datasets? Which items are most popular? Methods: The dataset records collected in this study were identified by filtering item types categorized as “data” or “dataset” using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item’s statistics report. The Handle identifier represents the dataset record’s persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository. Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS. Results: A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first timeframe a large number of PDFs were deposited by the Illinois Department of Agriculture. Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single-file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per

  3. Securing co-operation from persons supplying statistical data

    Science.gov (United States)

    Aubenque, M. J.; Blaikley, R. M.; Harris, F. Fraser; Lal, R. B.; Neurdenburg, M. G.; Hernández, R. de Shelly

    1954-01-01

    Securing the co-operation of persons supplying information required for medical statistics is essentially a problem in human relations, and an understanding of the motivations, attitudes, and behaviour of the respondents is necessary. Before any new statistical survey is undertaken, it is suggested by Aubenque and Harris that a preliminary review be made so that the maximum use is made of existing information. Care should also be taken not to burden respondents with an overloaded questionnaire. Aubenque and Harris recommend simplified reporting. Complete population coverage is not necessary. Neurdenburg suggests that the co-operation and support of such organizations as medical associations and social security boards are important and that propaganda should be directed specifically to the groups whose co-operation is sought. Informal personal contacts are valuable and desirable, according to Blaikley, but may have adverse effects if the right kind of approach is not made. Financial payments as an incentive in securing co-operation are opposed by Neurdenburg, who proposes that only postage-free envelopes or similar small favours be granted. Blaikley and Harris, on the other hand, express the view that financial incentives may do much to gain the support of those required to furnish data; there are, however, other incentives, and full use should be made of the natural inclinations of respondents. Compulsion may be necessary in certain instances, but administrative rather than statutory measures should be adopted. Penalties, according to Aubenque, should be inflicted only when justified by imperative health requirements. The results of surveys should be made available as soon as possible to those who co-operated, and Aubenque and Harris point out that they should also be of practical value to the suppliers of the information. Greater co-operation can be secured from medical persons who have an understanding of the statistical principles involved; Aubenque and

  4. Multivariate statistical analysis of atom probe tomography data

    International Nuclear Information System (INIS)

    Parish, Chad M.; Miller, Michael K.

    2010-01-01

    The application of spectrum imaging multivariate statistical analysis methods, specifically principal component analysis (PCA), to atom probe tomography (APT) data has been investigated. The mathematical method of analysis is described and the results for two example datasets are analyzed and presented. The first dataset is from the analysis of a PM 2000 Fe-Cr-Al-Ti steel containing two different ultrafine precipitate populations. PCA properly describes the matrix and precipitate phases in a simple and intuitive manner. A second APT example is from the analysis of an irradiated reactor pressure vessel steel. Fine, nm-scale Cu-enriched precipitates having a core-shell structure were identified and qualitatively described by PCA. Advantages, disadvantages, and future prospects for implementing these data analysis methodologies for APT datasets, particularly with regard to quantitative analysis, are also discussed.

  5. Multiple point statistical simulation using uncertain (soft) conditional data

    Science.gov (United States)

    Hansen, Thomas Mejer; Vu, Le Thanh; Mosegaard, Klaus; Cordua, Knud Skou

    2018-05-01

    Geostatistical simulation methods have been used to quantify spatial variability of reservoir models since the 80s. In the last two decades, state of the art simulation methods have changed from being based on covariance-based 2-point statistics to multiple-point statistics (MPS), that allow simulation of more realistic Earth-structures. In addition, increasing amounts of geo-information (geophysical, geological, etc.) from multiple sources are being collected. This pose the problem of integration of these different sources of information, such that decisions related to reservoir models can be taken on an as informed base as possible. In principle, though difficult in practice, this can be achieved using computationally expensive Monte Carlo methods. Here we investigate the use of sequential simulation based MPS simulation methods conditional to uncertain (soft) data, as a computational efficient alternative. First, it is demonstrated that current implementations of sequential simulation based on MPS (e.g. SNESIM, ENESIM and Direct Sampling) do not account properly for uncertain conditional information, due to a combination of using only co-located information, and a random simulation path. Then, we suggest two approaches that better account for the available uncertain information. The first make use of a preferential simulation path, where more informed model parameters are visited preferentially to less informed ones. The second approach involves using non co-located uncertain information. For different types of available data, these approaches are demonstrated to produce simulation results similar to those obtained by the general Monte Carlo based approach. These methods allow MPS simulation to condition properly to uncertain (soft) data, and hence provides a computationally attractive approach for integration of information about a reservoir model.

  6. Statistical Analysis of 30 Years Rainfall Data: A Case Study

    Science.gov (United States)

    Arvind, G.; Ashok Kumar, P.; Girish Karthi, S.; Suribabu, C. R.

    2017-07-01

    Rainfall is a prime input for various engineering design such as hydraulic structures, bridges and culverts, canals, storm water sewer and road drainage system. The detailed statistical analysis of each region is essential to estimate the relevant input value for design and analysis of engineering structures and also for crop planning. A rain gauge station located closely in Trichy district is selected for statistical analysis where agriculture is the prime occupation. The daily rainfall data for a period of 30 years is used to understand normal rainfall, deficit rainfall, Excess rainfall and Seasonal rainfall of the selected circle headquarters. Further various plotting position formulae available is used to evaluate return period of monthly, seasonally and annual rainfall. This analysis will provide useful information for water resources planner, farmers and urban engineers to assess the availability of water and create the storage accordingly. The mean, standard deviation and coefficient of variation of monthly and annual rainfall was calculated to check the rainfall variability. From the calculated results, the rainfall pattern is found to be erratic. The best fit probability distribution was identified based on the minimum deviation between actual and estimated values. The scientific results and the analysis paved the way to determine the proper onset and withdrawal of monsoon results which were used for land preparation and sowing.

  7. A statistical model for interpreting computerized dynamic posturography data

    Science.gov (United States)

    Feiveson, Alan H.; Metter, E. Jeffrey; Paloski, William H.

    2002-01-01

    Computerized dynamic posturography (CDP) is widely used for assessment of altered balance control. CDP trials are quantified using the equilibrium score (ES), which ranges from zero to 100, as a decreasing function of peak sway angle. The problem of how best to model and analyze ESs from a controlled study is considered. The ES often exhibits a skewed distribution in repeated trials, which can lead to incorrect inference when applying standard regression or analysis of variance models. Furthermore, CDP trials are terminated when a patient loses balance. In these situations, the ES is not observable, but is assigned the lowest possible score--zero. As a result, the response variable has a mixed discrete-continuous distribution, further compromising inference obtained by standard statistical methods. Here, we develop alternative methodology for analyzing ESs under a stochastic model extending the ES to a continuous latent random variable that always exists, but is unobserved in the event of a fall. Loss of balance occurs conditionally, with probability depending on the realized latent ES. After fitting the model by a form of quasi-maximum-likelihood, one may perform statistical inference to assess the effects of explanatory variables. An example is provided, using data from the NIH/NIA Baltimore Longitudinal Study on Aging.

  8. Analysis of filament statistics in fast camera data on MAST

    Science.gov (United States)

    Farley, Tom; Militello, Fulvio; Walkden, Nick; Harrison, James; Silburn, Scott; Bradley, James

    2017-10-01

    Coherent filamentary structures have been shown to play a dominant role in turbulent cross-field particle transport [D'Ippolito 2011]. An improved understanding of filaments is vital in order to control scrape off layer (SOL) density profiles and thus control first wall erosion, impurity flushing and coupling of radio frequency heating in future devices. The Elzar code [T. Farley, 2017 in prep.] is applied to MAST data. The code uses information about the magnetic equilibrium to calculate the intensity of light emission along field lines as seen in the camera images, as a function of the field lines' radial and toroidal locations at the mid-plane. In this way a `pseudo-inversion' of the intensity profiles in the camera images is achieved from which filaments can be identified and measured. In this work, a statistical analysis of the intensity fluctuations along field lines in the camera field of view is performed using techniques similar to those typically applied in standard Langmuir probe analyses. These filament statistics are interpreted in terms of the theoretical ergodic framework presented by F. Militello & J.T. Omotani, 2016, in order to better understand how time averaged filament dynamics produce the more familiar SOL density profiles. This work has received funding from the RCUK Energy programme (Grant Number EP/P012450/1), from Euratom (Grant Agreement No. 633053) and from the EUROfusion consortium.

  9. Statistics

    International Nuclear Information System (INIS)

    2005-01-01

    For the years 2004 and 2005 the figures shown in the tables of Energy Review are partly preliminary. The annual statistics published in Energy Review are presented in more detail in a publication called Energy Statistics that comes out yearly. Energy Statistics also includes historical time-series over a longer period of time (see e.g. Energy Statistics, Statistics Finland, Helsinki 2004.) The applied energy units and conversion coefficients are shown in the back cover of the Review. Explanatory notes to the statistical tables can be found after tables and figures. The figures presents: Changes in GDP, energy consumption and electricity consumption, Carbon dioxide emissions from fossile fuels use, Coal consumption, Consumption of natural gas, Peat consumption, Domestic oil deliveries, Import prices of oil, Consumer prices of principal oil products, Fuel prices in heat production, Fuel prices in electricity production, Price of electricity by type of consumer, Average monthly spot prices at the Nord pool power exchange, Total energy consumption by source and CO 2 -emissions, Supplies and total consumption of electricity GWh, Energy imports by country of origin in January-June 2003, Energy exports by recipient country in January-June 2003, Consumer prices of liquid fuels, Consumer prices of hard coal, natural gas and indigenous fuels, Price of natural gas by type of consumer, Price of electricity by type of consumer, Price of district heating by type of consumer, Excise taxes, value added taxes and fiscal charges and fees included in consumer prices of some energy sources and Energy taxes, precautionary stock fees and oil pollution fees

  10. Topics in statistical data analysis for high-energy physics

    International Nuclear Information System (INIS)

    Cowan, G.

    2011-01-01

    These lectures concert two topics that are becoming increasingly important in the analysis of high-energy physics data: Bayesian statistics and multivariate methods. In the Bayesian approach, we extend the interpretation of probability not only to cover the frequency of repeatable outcomes but also to include a degree of belief. In this way we are able to associate probability with a hypothesis and thus to answer directly questions that cannot be addressed easily with traditional frequentist methods. In multivariate analysis, we try to exploit as much information as possible from the characteristics that we measure for each event to distinguish between event types. In particular we will look at a method that has gained popularity in high-energy physics in recent years: the boosted decision tree. Finally, we give a brief sketch of how multivariate methods may be applied in a search for a new signal process. (author)

  11. A spatial scan statistic for compound Poisson data.

    Science.gov (United States)

    Rosychuk, Rhonda J; Chang, Hsing-Ming

    2013-12-20

    The topic of spatial cluster detection gained attention in statistics during the late 1980s and early 1990s. Effort has been devoted to the development of methods for detecting spatial clustering of cases and events in the biological sciences, astronomy and epidemiology. More recently, research has examined detecting clusters of correlated count data associated with health conditions of individuals. Such a method allows researchers to examine spatial relationships of disease-related events rather than just incident or prevalent cases. We introduce a spatial scan test that identifies clusters of events in a study region. Because an individual case may have multiple (repeated) events, we base the test on a compound Poisson model. We illustrate our method for cluster detection on emergency department visits, where individuals may make multiple disease-related visits. Copyright © 2013 John Wiley & Sons, Ltd.

  12. The Need for the Dissemination of Statistical Data and Information

    Directory of Open Access Journals (Sweden)

    Anna-Alexandra Frunza

    2016-01-01

    Full Text Available There is an emphasis nowadays on knowledge, so the access to information has increased inrelevance in the modern economies which have developed their competitive advantage thoroughtheir dynamic response to the market changes. The effort for transparency has increasedtremendously within the last decades which have been also influenced by the weight that the digitalsupport has provided. The need for the dissemination of statistical data and information has metnew challenges in terms of aggregating the practices that both private and public organizations usein order to ensure the optimum access to the end users. The article stresses some key questions thatcan be introduced which ease the process of collection and presentation of the results subject todissemination.

  13. Statistical Analysis of Data with Non-Detectable Values

    Energy Technology Data Exchange (ETDEWEB)

    Frome, E.L.

    2004-08-26

    Environmental exposure measurements are, in general, positive and may be subject to left censoring, i.e. the measured value is less than a ''limit of detection''. In occupational monitoring, strategies for assessing workplace exposures typically focus on the mean exposure level or the probability that any measurement exceeds a limit. A basic problem of interest in environmental risk assessment is to determine if the mean concentration of an analyte is less than a prescribed action level. Parametric methods, used to determine acceptable levels of exposure, are often based on a two parameter lognormal distribution. The mean exposure level and/or an upper percentile (e.g. the 95th percentile) are used to characterize exposure levels, and upper confidence limits are needed to describe the uncertainty in these estimates. In certain situations it is of interest to estimate the probability of observing a future (or ''missed'') value of a lognormal variable. Statistical methods for random samples (without non-detects) from the lognormal distribution are well known for each of these situations. In this report, methods for estimating these quantities based on the maximum likelihood method for randomly left censored lognormal data are described and graphical methods are used to evaluate the lognormal assumption. If the lognormal model is in doubt and an alternative distribution for the exposure profile of a similar exposure group is not available, then nonparametric methods for left censored data are used. The mean exposure level, along with the upper confidence limit, is obtained using the product limit estimate, and the upper confidence limit on the 95th percentile (i.e. the upper tolerance limit) is obtained using a nonparametric approach. All of these methods are well known but computational complexity has limited their use in routine data analysis with left censored data. The recent development of the R environment for statistical

  14. Statistics

    International Nuclear Information System (INIS)

    2001-01-01

    For the year 2000, part of the figures shown in the tables of the Energy Review are preliminary or estimated. The annual statistics of the Energy Review appear in more detail from the publication Energiatilastot - Energy Statistics issued annually, which also includes historical time series over a longer period (see e.g. Energiatilastot 1999, Statistics Finland, Helsinki 2000, ISSN 0785-3165). The inside of the Review's back cover shows the energy units and the conversion coefficients used for them. Explanatory notes to the statistical tables can be found after tables and figures. The figures presents: Changes in the volume of GNP and energy consumption, Changes in the volume of GNP and electricity, Coal consumption, Natural gas consumption, Peat consumption, Domestic oil deliveries, Import prices of oil, Consumer prices of principal oil products, Fuel prices for heat production, Fuel prices for electricity production, Carbon dioxide emissions from the use of fossil fuels, Total energy consumption by source and CO 2 -emissions, Electricity supply, Energy imports by country of origin in 2000, Energy exports by recipient country in 2000, Consumer prices of liquid fuels, Consumer prices of hard coal, natural gas and indigenous fuels, Average electricity price by type of consumer, Price of district heating by type of consumer, Excise taxes, value added taxes and fiscal charges and fees included in consumer prices of some energy sources and Energy taxes and precautionary stock fees on oil products

  15. Statistics

    International Nuclear Information System (INIS)

    2000-01-01

    For the year 1999 and 2000, part of the figures shown in the tables of the Energy Review are preliminary or estimated. The annual statistics of the Energy Review appear in more detail from the publication Energiatilastot - Energy Statistics issued annually, which also includes historical time series over a longer period (see e.g., Energiatilastot 1998, Statistics Finland, Helsinki 1999, ISSN 0785-3165). The inside of the Review's back cover shows the energy units and the conversion coefficients used for them. Explanatory notes to the statistical tables can be found after tables and figures. The figures presents: Changes in the volume of GNP and energy consumption, Changes in the volume of GNP and electricity, Coal consumption, Natural gas consumption, Peat consumption, Domestic oil deliveries, Import prices of oil, Consumer prices of principal oil products, Fuel prices for heat production, Fuel prices for electricity production, Carbon dioxide emissions, Total energy consumption by source and CO 2 -emissions, Electricity supply, Energy imports by country of origin in January-March 2000, Energy exports by recipient country in January-March 2000, Consumer prices of liquid fuels, Consumer prices of hard coal, natural gas and indigenous fuels, Average electricity price by type of consumer, Price of district heating by type of consumer, Excise taxes, value added taxes and fiscal charges and fees included in consumer prices of some energy sources and Energy taxes and precautionary stock fees on oil products

  16. Statistics

    International Nuclear Information System (INIS)

    1999-01-01

    For the year 1998 and the year 1999, part of the figures shown in the tables of the Energy Review are preliminary or estimated. The annual statistics of the Energy Review appear in more detail from the publication Energiatilastot - Energy Statistics issued annually, which also includes historical time series over a longer period (see e.g. Energiatilastot 1998, Statistics Finland, Helsinki 1999, ISSN 0785-3165). The inside of the Review's back cover shows the energy units and the conversion coefficients used for them. Explanatory notes to the statistical tables can be found after tables and figures. The figures presents: Changes in the volume of GNP and energy consumption, Changes in the volume of GNP and electricity, Coal consumption, Natural gas consumption, Peat consumption, Domestic oil deliveries, Import prices of oil, Consumer prices of principal oil products, Fuel prices for heat production, Fuel prices for electricity production, Carbon dioxide emissions, Total energy consumption by source and CO 2 -emissions, Electricity supply, Energy imports by country of origin in January-June 1999, Energy exports by recipient country in January-June 1999, Consumer prices of liquid fuels, Consumer prices of hard coal, natural gas and indigenous fuels, Average electricity price by type of consumer, Price of district heating by type of consumer, Excise taxes, value added taxes and fiscal charges and fees included in consumer prices of some energy sources and Energy taxes and precautionary stock fees on oil products

  17. Implementation of statistical analysis methods for medical physics data

    International Nuclear Information System (INIS)

    Teixeira, Marilia S.; Pinto, Nivia G.P.; Barroso, Regina C.; Oliveira, Luis F.

    2009-01-01

    The objective of biomedical research with different radiation natures is to contribute for the understanding of the basic physics and biochemistry of the biological systems, the disease diagnostic and the development of the therapeutic techniques. The main benefits are: the cure of tumors through the therapy, the anticipated detection of diseases through the diagnostic, the using as prophylactic mean for blood transfusion, etc. Therefore, for the better understanding of the biological interactions occurring after exposure to radiation, it is necessary for the optimization of therapeutic procedures and strategies for reduction of radioinduced effects. The group pf applied physics of the Physics Institute of UERJ have been working in the characterization of biological samples (human tissues, teeth, saliva, soil, plants, sediments, air, water, organic matrixes, ceramics, fossil material, among others) using X-rays diffraction and X-ray fluorescence. The application of these techniques for measurement, analysis and interpretation of the biological tissues characteristics are experimenting considerable interest in the Medical and Environmental Physics. All quantitative data analysis must be initiated with descriptive statistic calculation (means and standard deviations) in order to obtain a previous notion on what the analysis will reveal. It is well known que o high values of standard deviation found in experimental measurements of biologicals samples can be attributed to biological factors, due to the specific characteristics of each individual (age, gender, environment, alimentary habits, etc). This work has the main objective the development of a program for the use of specific statistic methods for the optimization of experimental data an analysis. The specialized programs for this analysis are proprietary, another objective of this work is the implementation of a code which is free and can be shared by the other research groups. As the program developed since the

  18. Statistical Clustering and Compositional Modeling of Iapetus VIMS Spectral Data

    Science.gov (United States)

    Pinilla-Alonso, N.; Roush, T. L.; Marzo, G.; Dalle Ore, C. M.; Cruikshank, D. P.

    2009-12-01

    It has long been known that the surfaces of Saturn's major satellites are predominantly icy objects [e.g. 1 and references therein]. Since 2004, these bodies have been the subject of observations by the Cassini-VIMS (Visual and Infrared Mapping Spectrometer) experiment [2]. Iapetus has the unique property that the hemisphere centered on the apex of its locked synchronous orbital motion around Saturn has a very low geometrical albedo of 2-6%, while the opposite hemisphere is about 10 times more reflective. The nature and origin of the dark material of Iapetus has remained a question since its discovery [3 and references therein]. The nature of this material and how it is distributed on the surface of this body, can shed new light into the knowledge of the Saturnian system. We apply statistical clustering [4] and theoretical modeling [5,6] to address the surface composition of Iapetus. The VIMS data evaluated were obtained during the second flyby of Iapetus, in September 2007. This close approach allowed VIMS to obtain spectra at relatively high spatial resolution, ~1-22 km/pixel. The data we study sampled the trailing hemisphere and part of the dark leading one. The statistical clustering [4] is used to identify statistically distinct spectra on Iapetus. The composition of these distinct spectra are evaluated using theoretical models [5,6]. We thank Allan Meyer for his help. This research was supported by an appointment to the NASA Postdoctoral Program at the Ames Research Center, administered by Oak Ridge Associated Universities through a contract with NASA. [1] A, Coradini et al., 2009, Earth, Moon & Planets, 105, 289-310. [2] Brown et al., 2004, Space Science Reviews, 115, 111-168. [3] Cruikshank, D. et al Icarus, 2008, 193, 334-343. [4] Marzo, G. et al. 2008, Journal of Geophysical Research, 113, E12, CiteID E12009. [5] Hapke, B. 1993, Theory of reflectance and emittance spectroscopy, Cambridge University Press. [6] Shkuratov, Y. et al. 1999, Icarus, 137, 235-246.

  19. The International Coal Statistics Data Base program maintenance guide

    International Nuclear Information System (INIS)

    1991-06-01

    The International Coal Statistics Data Base (ICSD) is a microcomputer-based system which contains information related to international coal trade. This includes coal production, consumption, imports and exports information. The ICSD is a secondary data base, meaning that information contained therein is derived entirely from other primary sources. It uses dBase III+ and Lotus 1-2-3 to locate, report and display data. The system is used for analysis in preparing the Annual Prospects for World Coal Trade (DOE/EIA-0363) publication. The ICSD system is menu driven and also permits the user who is familiar with dBase and Lotus operations to leave the menu structure to perform independent queries. Documentation for the ICSD consists of three manuals -- the User's Guide, the Operations Manual, and the Program Maintenance Manual. This Program Maintenance Manual provides the information necessary to maintain and update the ICSD system. Two major types of program maintenance documentation are presented in this manual. The first is the source code for the dBase III+ routines and related non-dBase programs used in operating the ICSD. The second is listings of the major component database field structures. A third important consideration for dBase programming, the structure of index files, is presented in the listing of source code for the index maintenance program. 1 fig

  20. Data Analysis & Statistical Methods for Command File Errors

    Science.gov (United States)

    Meshkat, Leila; Waggoner, Bruce; Bryant, Larry

    2014-01-01

    This paper explains current work on modeling for managing the risk of command file errors. It is focused on analyzing actual data from a JPL spaceflight mission to build models for evaluating and predicting error rates as a function of several key variables. We constructed a rich dataset by considering the number of errors, the number of files radiated, including the number commands and blocks in each file, as well as subjective estimates of workload and operational novelty. We have assessed these data using different curve fitting and distribution fitting techniques, such as multiple regression analysis, and maximum likelihood estimation to see how much of the variability in the error rates can be explained with these. We have also used goodness of fit testing strategies and principal component analysis to further assess our data. Finally, we constructed a model of expected error rates based on the what these statistics bore out as critical drivers to the error rate. This model allows project management to evaluate the error rate against a theoretically expected rate as well as anticipate future error rates.

  1. Statistical inference for imperfect maintenance models with missing data

    International Nuclear Information System (INIS)

    Dijoux, Yann; Fouladirad, Mitra; Nguyen, Dinh Tuan

    2016-01-01

    The paper considers complex industrial systems with incomplete maintenance history. A corrective maintenance is performed after the occurrence of a failure and its efficiency is assumed to be imperfect. In maintenance analysis, the databases are not necessarily complete. Specifically, the observations are assumed to be window-censored. This situation arises relatively frequently after the purchase of a second-hand unit or in the absence of maintenance record during the burn-in phase. The joint assessment of the wear-out of the system and the maintenance efficiency is investigated under missing data. A review along with extensions of statistical inference procedures from an observation window are proposed in the case of perfect and minimal repair using the renewal and Poisson theories, respectively. Virtual age models are employed to model imperfect repair. In this framework, new estimation procedures are developed. In particular, maximum likelihood estimation methods are derived for the most classical virtual age models. The benefits of the new estimation procedures are highlighted by numerical simulations and an application to a real data set. - Highlights: • New estimation procedures for window-censored observations and imperfect repair. • Extensions of inference methods for perfect and minimal repair with missing data. • Overview of maximum likelihood method with complete and incomplete observations. • Benefits of the new procedures highlighted by simulation studies and real application.

  2. Statistics

    International Nuclear Information System (INIS)

    2003-01-01

    For the year 2002, part of the figures shown in the tables of the Energy Review are partly preliminary. The annual statistics of the Energy Review also includes historical time-series over a longer period (see e.g. Energiatilastot 2001, Statistics Finland, Helsinki 2002). The applied energy units and conversion coefficients are shown in the inside back cover of the Review. Explanatory notes to the statistical tables can be found after tables and figures. The figures presents: Changes in GDP, energy consumption and electricity consumption, Carbon dioxide emissions from fossile fuels use, Coal consumption, Consumption of natural gas, Peat consumption, Domestic oil deliveries, Import prices of oil, Consumer prices of principal oil products, Fuel prices in heat production, Fuel prices in electricity production, Price of electricity by type of consumer, Average monthly spot prices at the Nord pool power exchange, Total energy consumption by source and CO 2 -emissions, Supply and total consumption of electricity GWh, Energy imports by country of origin in January-June 2003, Energy exports by recipient country in January-June 2003, Consumer prices of liquid fuels, Consumer prices of hard coal, natural gas and indigenous fuels, Price of natural gas by type of consumer, Price of electricity by type of consumer, Price of district heating by type of consumer, Excise taxes, value added taxes and fiscal charges and fees included in consumer prices of some energy sources and Excise taxes, precautionary stock fees on oil pollution fees on energy products

  3. Statistics

    International Nuclear Information System (INIS)

    2004-01-01

    For the year 2003 and 2004, the figures shown in the tables of the Energy Review are partly preliminary. The annual statistics of the Energy Review also includes historical time-series over a longer period (see e.g. Energiatilastot, Statistics Finland, Helsinki 2003, ISSN 0785-3165). The applied energy units and conversion coefficients are shown in the inside back cover of the Review. Explanatory notes to the statistical tables can be found after tables and figures. The figures presents: Changes in GDP, energy consumption and electricity consumption, Carbon dioxide emissions from fossile fuels use, Coal consumption, Consumption of natural gas, Peat consumption, Domestic oil deliveries, Import prices of oil, Consumer prices of principal oil products, Fuel prices in heat production, Fuel prices in electricity production, Price of electricity by type of consumer, Average monthly spot prices at the Nord pool power exchange, Total energy consumption by source and CO 2 -emissions, Supplies and total consumption of electricity GWh, Energy imports by country of origin in January-March 2004, Energy exports by recipient country in January-March 2004, Consumer prices of liquid fuels, Consumer prices of hard coal, natural gas and indigenous fuels, Price of natural gas by type of consumer, Price of electricity by type of consumer, Price of district heating by type of consumer, Excise taxes, value added taxes and fiscal charges and fees included in consumer prices of some energy sources and Excise taxes, precautionary stock fees on oil pollution fees

  4. Statistics

    International Nuclear Information System (INIS)

    2000-01-01

    For the year 1999 and 2000, part of the figures shown in the tables of the Energy Review are preliminary or estimated. The annual statistics of the Energy also includes historical time series over a longer period (see e.g., Energiatilastot 1999, Statistics Finland, Helsinki 2000, ISSN 0785-3165). The inside of the Review's back cover shows the energy units and the conversion coefficients used for them. Explanatory notes to the statistical tables can be found after tables and figures. The figures presents: Changes in the volume of GNP and energy consumption, Changes in the volume of GNP and electricity, Coal consumption, Natural gas consumption, Peat consumption, Domestic oil deliveries, Import prices of oil, Consumer prices of principal oil products, Fuel prices for heat production, Fuel prices for electricity production, Carbon dioxide emissions, Total energy consumption by source and CO 2 -emissions, Electricity supply, Energy imports by country of origin in January-June 2000, Energy exports by recipient country in January-June 2000, Consumer prices of liquid fuels, Consumer prices of hard coal, natural gas and indigenous fuels, Average electricity price by type of consumer, Price of district heating by type of consumer, Excise taxes, value added taxes and fiscal charges and fees included in consumer prices of some energy sources and Energy taxes and precautionary stock fees on oil products

  5. SEDA: A software package for the Statistical Earthquake Data Analysis

    Science.gov (United States)

    Lombardi, A. M.

    2017-03-01

    In this paper, the first version of the software SEDA (SEDAv1.0), designed to help seismologists statistically analyze earthquake data, is presented. The package consists of a user-friendly Matlab-based interface, which allows the user to easily interact with the application, and a computational core of Fortran codes, to guarantee the maximum speed. The primary factor driving the development of SEDA is to guarantee the research reproducibility, which is a growing movement among scientists and highly recommended by the most important scientific journals. SEDAv1.0 is mainly devoted to produce accurate and fast outputs. Less care has been taken for the graphic appeal, which will be improved in the future. The main part of SEDAv1.0 is devoted to the ETAS modeling. SEDAv1.0 contains a set of consistent tools on ETAS, allowing the estimation of parameters, the testing of model on data, the simulation of catalogs, the identification of sequences and forecasts calculation. The peculiarities of routines inside SEDAv1.0 are discussed in this paper. More specific details on the software are presented in the manual accompanying the program package.

  6. Advanced data analysis in neuroscience integrating statistical and computational models

    CERN Document Server

    Durstewitz, Daniel

    2017-01-01

    This book is intended for use in advanced graduate courses in statistics / machine learning, as well as for all experimental neuroscientists seeking to understand statistical methods at a deeper level, and theoretical neuroscientists with a limited background in statistics. It reviews almost all areas of applied statistics, from basic statistical estimation and test theory, linear and nonlinear approaches for regression and classification, to model selection and methods for dimensionality reduction, density estimation and unsupervised clustering.  Its focus, however, is linear and nonlinear time series analysis from a dynamical systems perspective, based on which it aims to convey an understanding also of the dynamical mechanisms that could have generated observed time series. Further, it integrates computational modeling of behavioral and neural dynamics with statistical estimation and hypothesis testing. This way computational models in neuroscience are not only explanat ory frameworks, but become powerfu...

  7. Enerdata statistical yearbook. ''the key-data of energy worldwide''. 1999 data

    International Nuclear Information System (INIS)

    2000-01-01

    The new edition of the Enerdata statistical yearbook provides the most recent statistical data on energy (oil, gas, coal and power production) and CO 2 emissions worldwide for the 1994-1999 period of time. These data cover 52 countries and 12 geographic areas and are presented in the form of tables and graphs (production, foreign exchanges, consumptions, market shares, sectoral consumption, 1999 energy status, long-term tendencies). More data for a longer period (1970-1999) and for all countries worldwide are available on the CD-Rom version of the yearbook. (J.S.)

  8. Criminal victimization in Ukraine: analysis of statistical data

    Directory of Open Access Journals (Sweden)

    Serhiy Nezhurbida

    2007-12-01

    Full Text Available The article is based on the analysis of statistical data provided by law-enforcement, judicial and other bodies of Ukraine. The given analysis allows us to give an accurate quantity of a current status of crime victimization in Ukraine, to characterize its basic features (level, rate, structure, dynamics, and etc.. L’article se concentre sur l’analyse des données statystiques fournies par les institutions de contrôle sociale (forces de police et magistrature et par d’autres organes institutionnels ukrainiens. Les analyses effectuées attirent l'attention sur la situation actuelle des victimes du crime en Ukraine et aident à délinéer leur principales caractéristiques (niveau, taux, structure, dynamiques, etc.L’articolo si basa sull’analisi dei dati statistici forniti dalle agenzie del controllo sociale (forze dell'ordine e magistratura e da altri organi istituzionali ucraini. Le analisi effettuate forniscono molte informazioni sulla situazione attuale delle vittime del crimine in Ucraina e aiutano a delinearne le caratteristiche principali (livello, tasso, struttura, dinamiche, ecc..

  9. Statistical Modelling of Wind Proles - Data Analysis and Modelling

    DEFF Research Database (Denmark)

    Jónsson, Tryggvi; Pinson, Pierre

    The aim of the analysis presented in this document is to investigate whether statistical models can be used to make very short-term predictions of wind profiles.......The aim of the analysis presented in this document is to investigate whether statistical models can be used to make very short-term predictions of wind profiles....

  10. Sensitivity analysis of ranked data: from order statistics to quantiles

    NARCIS (Netherlands)

    Heidergott, B.F.; Volk-Makarewicz, W.

    2015-01-01

    In this paper we provide the mathematical theory for sensitivity analysis of order statistics of continuous random variables, where the sensitivity is with respect to a distributional parameter. Sensitivity analysis of order statistics over a finite number of observations is discussed before

  11. Applying Statistical Process Control to Clinical Data: An Illustration.

    Science.gov (United States)

    Pfadt, Al; And Others

    1992-01-01

    Principles of statistical process control are applied to a clinical setting through the use of control charts to detect changes, as part of treatment planning and clinical decision-making processes. The logic of control chart analysis is derived from principles of statistical inference. Sample charts offer examples of evaluating baselines and…

  12. EBprot: Statistical analysis of labeling-based quantitative proteomics data.

    Science.gov (United States)

    Koh, Hiromi W L; Swa, Hannah L F; Fermin, Damian; Ler, Siok Ghee; Gunaratne, Jayantha; Choi, Hyungwon

    2015-08-01

    Labeling-based proteomics is a powerful method for detection of differentially expressed proteins (DEPs). The current data analysis platform typically relies on protein-level ratios, which is obtained by summarizing peptide-level ratios for each protein. In shotgun proteomics, however, some proteins are quantified with more peptides than others, and this reproducibility information is not incorporated into the differential expression (DE) analysis. Here, we propose a novel probabilistic framework EBprot that directly models the peptide-protein hierarchy and rewards the proteins with reproducible evidence of DE over multiple peptides. To evaluate its performance with known DE states, we conducted a simulation study to show that the peptide-level analysis of EBprot provides better receiver-operating characteristic and more accurate estimation of the false discovery rates than the methods based on protein-level ratios. We also demonstrate superior classification performance of peptide-level EBprot analysis in a spike-in dataset. To illustrate the wide applicability of EBprot in different experimental designs, we applied EBprot to a dataset for lung cancer subtype analysis with biological replicates and another dataset for time course phosphoproteome analysis of EGF-stimulated HeLa cells with multiplexed labeling. Through these examples, we show that the peptide-level analysis of EBprot is a robust alternative to the existing statistical methods for the DE analysis of labeling-based quantitative datasets. The software suite is freely available on the Sourceforge website http://ebprot.sourceforge.net/. All MS data have been deposited in the ProteomeXchange with identifier PXD001426 (http://proteomecentral.proteomexchange.org/dataset/PXD001426/). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  13. A statistical framework for differential network analysis from microarray data

    Directory of Open Access Journals (Sweden)

    Datta Somnath

    2010-02-01

    Full Text Available Abstract Background It has been long well known that genes do not act alone; rather groups of genes act in consort during a biological process. Consequently, the expression levels of genes are dependent on each other. Experimental techniques to detect such interacting pairs of genes have been in place for quite some time. With the advent of microarray technology, newer computational techniques to detect such interaction or association between gene expressions are being proposed which lead to an association network. While most microarray analyses look for genes that are differentially expressed, it is of potentially greater significance to identify how entire association network structures change between two or more biological settings, say normal versus diseased cell types. Results We provide a recipe for conducting a differential analysis of networks constructed from microarray data under two experimental settings. At the core of our approach lies a connectivity score that represents the strength of genetic association or interaction between two genes. We use this score to propose formal statistical tests for each of following queries: (i whether the overall modular structures of the two networks are different, (ii whether the connectivity of a particular set of "interesting genes" has changed between the two networks, and (iii whether the connectivity of a given single gene has changed between the two networks. A number of examples of this score is provided. We carried out our method on two types of simulated data: Gaussian networks and networks based on differential equations. We show that, for appropriate choices of the connectivity scores and tuning parameters, our method works well on simulated data. We also analyze a real data set involving normal versus heavy mice and identify an interesting set of genes that may play key roles in obesity. Conclusions Examining changes in network structure can provide valuable information about the

  14. Chronic Obstructive Pulmonary Disease (COPD): Data and Statistics

    Science.gov (United States)

    ... and Statistics Recommend on Facebook Tweet Share Compartir COPD Death Rates in the United States Printable Version [ ... Ohio and Mississippi Rivers. Printable Version [PDF 733KB] COPD Prevalence in the United States Printable Version [PDF ...

  15. Use of Statistics for Data Evaluation in Environmental Radioactivity Measurements

    International Nuclear Information System (INIS)

    Sutarman

    2001-01-01

    Counting statistics will give a correction on environmental radioactivity measurement result. Statistics provides formulas to determine standard deviation (S B ) and minimum detectable concentration (MDC) according to the Poisson distribution. Both formulas depend on the background count rate, counting time, counting efficiency, gamma intensity, and sample size. A long time background counting results in relatively low S B and MDC that can present relatively accurate measurement results. (author)

  16. Network similarity and statistical analysis of earthquake seismic data

    OpenAIRE

    Deyasi, Krishanu; Chakraborty, Abhijit; Banerjee, Anirban

    2016-01-01

    We study the structural similarity of earthquake networks constructed from seismic catalogs of different geographical regions. A hierarchical clustering of underlying undirected earthquake networks is shown using Jensen-Shannon divergence in graph spectra. The directed nature of links indicates that each earthquake network is strongly connected, which motivates us to study the directed version statistically. Our statistical analysis of each earthquake region identifies the hub regions. We cal...

  17. Critical Views of 8th Grade Students toward Statistical Data in Newspaper Articles: Analysis in Light of Statistical Literacy

    Science.gov (United States)

    Guler, Mustafa; Gursoy, Kadir; Guven, Bulent

    2016-01-01

    Understanding and interpreting biased data, decision-making in accordance with the data, and critically evaluating situations involving data are among the fundamental skills necessary in the modern world. To develop these required skills, emphasis on statistical literacy in school mathematics has been gradually increased in recent years. The…

  18. Data-driven inference for the spatial scan statistic

    Directory of Open Access Journals (Sweden)

    Duczmal Luiz H

    2011-08-01

    Full Text Available Abstract Background Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. Results A modification is proposed to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found. A new interpretation of the results of the spatial scan statistic is done, posing a modified inference question: what is the probability that the null hypothesis is rejected for the original observed cases map with a most likely cluster of size k, taking into account only those most likely clusters of size k found under null hypothesis for comparison? This question is especially important when the p-value computed by the usual inference process is near the alpha significance level, regarding the correctness of the decision based in this inference. Conclusions A practical procedure is provided to make more accurate inferences about the most likely cluster found by the spatial scan statistic.

  19. Data-driven inference for the spatial scan statistic.

    Science.gov (United States)

    Almeida, Alexandre C L; Duarte, Anderson R; Duczmal, Luiz H; Oliveira, Fernando L P; Takahashi, Ricardo H C

    2011-08-02

    Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas) or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. A modification is proposed to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found. A new interpretation of the results of the spatial scan statistic is done, posing a modified inference question: what is the probability that the null hypothesis is rejected for the original observed cases map with a most likely cluster of size k, taking into account only those most likely clusters of size k found under null hypothesis for comparison? This question is especially important when the p-value computed by the usual inference process is near the alpha significance level, regarding the correctness of the decision based in this inference. A practical procedure is provided to make more accurate inferences about the most likely cluster found by the spatial scan statistic.

  20. IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies.

    Science.gov (United States)

    Dai, Mingwei; Ming, Jingsi; Cai, Mingxuan; Liu, Jin; Yang, Can; Wan, Xiang; Xu, Zongben

    2017-09-15

    Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as 'polygenicity'. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question. In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by i ntegrating individual level ge notype data and s ummary s tatistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% ( ±0.4% ) to 69.4% ( ±0.1% ) using about 240 000 variants. The IGESS software is available at https://github.com/daviddaigithub/IGESS . zbxu@xjtu.edu.cn or xwan@comp.hkbu.edu.hk or eeyang@hkbu.edu.hk. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  1. Statistical Analysis of CMC Constituent and Processing Data

    Science.gov (United States)

    Fornuff, Jonathan

    2004-01-01

    Ceramic Matrix Composites (CMCs) are the next "big thing" in high-temperature structural materials. In the case of jet engines, it is widely believed that the metallic superalloys currently being utilized for hot structures (combustors, shrouds, turbine vanes and blades) are nearing their potential limits of improvement. In order to allow for increased turbine temperatures to increase engine efficiency, material scientists have begun looking toward advanced CMCs and SiC/SiC composites in particular. Ceramic composites provide greater strength-to-weight ratios at higher temperatures than metallic alloys, but at the same time require greater challenges in micro-structural optimization that in turn increases the cost of the material as well as increases the risk of variability in the material s thermo-structural behavior. to model various potential CMC engine materials and examines the current variability in these properties due to variability in component processing conditions and constituent materials; then, to see how processing and constituent variations effect key strength, stiffness, and thermal properties of the finished components. Basically, this means trying to model variations in the component s behavior by knowing what went into creating it. inter-phase and manufactured by chemical vapor infiltration (CVI) and melt infiltration (MI) were considered. Examinations of: (1) the percent constituents by volume, (2) the inter-phase thickness, (3) variations in the total porosity, and (4) variations in the chemical composition of the Sic fiber are carried out and modeled using various codes used here at NASA-Glenn (PCGina, NASALife, CEMCAN, etc...). The effects of these variations and the ranking of their respective influences on the various thermo-mechanical material properties are studied and compared to available test data. The properties of the materials as well as minor changes to geometry are then made to the computer model and the detrimental effects

  2. Categorical and nonparametric data analysis choosing the best statistical technique

    CERN Document Server

    Nussbaum, E Michael

    2014-01-01

    Featuring in-depth coverage of categorical and nonparametric statistics, this book provides a conceptual framework for choosing the most appropriate type of test in various research scenarios. Class tested at the University of Nevada, the book's clear explanations of the underlying assumptions, computer simulations, and Exploring the Concept boxes help reduce reader anxiety. Problems inspired by actual studies provide meaningful illustrations of the techniques. The underlying assumptions of each test and the factors that impact validity and statistical power are reviewed so readers can explain

  3. Bayesian statistical analysis of censored data in geotechnical engineering

    DEFF Research Database (Denmark)

    Ditlevsen, Ove Dalager; Tarp-Johansen, Niels Jacob; Denver, Hans

    2000-01-01

    The geotechnical engineer is often faced with the problem ofhow to assess the statistical properties of a soil parameter on the basis ofa sample measured in-situ or in the laboratory with the defect that somevalues have been replaced by interval bounds because the corresponding soilparameter values...

  4. Conducting tests for statistically significant differences using forest inventory data

    Science.gov (United States)

    James A. Westfall; Scott A. Pugh; John W. Coulston

    2013-01-01

    Many forest inventory and monitoring programs are based on a sample of ground plots from which estimates of forest resources are derived. In addition to evaluating metrics such as number of trees or amount of cubic wood volume, it is often desirable to make comparisons between resource attributes. To properly conduct statistical tests for differences, it is imperative...

  5. Extreme value theory and statistics for heavy tail data

    NARCIS (Netherlands)

    S. Caserta; C.G. de Vries (Casper)

    2003-01-01

    textabstractA scientific way of looking beyond the worst-case return is to employ statistical extreme value methods. Extreme Value Theory (EVT) shows that the probability on very large losses is eventually governed by a simple function, regardless the specific distribution that underlies the return

  6. Submarine Upward Looking Sonar Ice Draft Profile Data and Statistics

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — This data set consists of upward looking sonar draft data collected by submarines in the Arctic Ocean. It includes data from both U.S. Navy and Royal Navy...

  7. Statistical data processing of mobility curves of univalent weak bases

    Czech Academy of Sciences Publication Activity Database

    Šlampová, Andrea; Boček, Petr

    2008-01-01

    Roč. 29, č. 2 (2008), s. 538-541 ISSN 0173-0835 R&D Projects: GA AV ČR IAA400310609; GA ČR GA203/05/2106 Institutional research plan: CEZ:AV0Z40310501 Keywords : mobility curve * univalent weak bases * statistical evaluation Subject RIV: CB - Analytical Chemistry, Separation Impact factor: 3.509, year: 2008

  8. Development of computer-assisted instruction application for statistical data analysis android platform as learning resource

    Science.gov (United States)

    Hendikawati, P.; Arifudin, R.; Zahid, M. Z.

    2018-03-01

    This study aims to design an android Statistics Data Analysis application that can be accessed through mobile devices to making it easier for users to access. The Statistics Data Analysis application includes various topics of basic statistical along with a parametric statistics data analysis application. The output of this application system is parametric statistics data analysis that can be used for students, lecturers, and users who need the results of statistical calculations quickly and easily understood. Android application development is created using Java programming language. The server programming language uses PHP with the Code Igniter framework, and the database used MySQL. The system development methodology used is the Waterfall methodology with the stages of analysis, design, coding, testing, and implementation and system maintenance. This statistical data analysis application is expected to support statistical lecturing activities and make students easier to understand the statistical analysis of mobile devices.

  9. Statistical Data Analyses of Trace Chemical, Biochemical, and Physical Analytical Signatures

    Energy Technology Data Exchange (ETDEWEB)

    Udey, Ruth Norma [Michigan State Univ., East Lansing, MI (United States)

    2013-01-01

    Analytical and bioanalytical chemistry measurement results are most meaningful when interpreted using rigorous statistical treatments of the data. The same data set may provide many dimensions of information depending on the questions asked through the applied statistical methods. Three principal projects illustrated the wealth of information gained through the application of statistical data analyses to diverse problems.

  10. 75 FR 24718 - Guidance for Industry on Documenting Statistical Analysis Programs and Data Files; Availability

    Science.gov (United States)

    2010-05-05

    ...] Guidance for Industry on Documenting Statistical Analysis Programs and Data Files; Availability AGENCY... documenting statistical analyses and data files submitted to the Center for Veterinary Medicine (CVM) for the... on Documenting Statistical Analysis Programs and Data Files; Availability'' giving interested persons...

  11. Computer processing of 14C data; statistical tests and corrections of data

    International Nuclear Information System (INIS)

    Obelic, B.; Planinic, J.

    1977-01-01

    The described computer program calculates the age of samples and performs statistical tests and corrections of data. Data are obtained from the proportional counter that measures anticoincident pulses per 20 minute intervals. After every 9th interval the counter measures total number of counts per interval. Input data are punched on cards. The output list contains input data schedule and the following results: mean CPM value, correction of CPM for normal pressure and temperature (NTP), sample age calculation based on 14 C half life of 5570 and 5730 years, age correction for NTP, dendrochronological corrections and the relative radiocarbon concentration. All results are given with one standard deviation. Input data test (Chauvenet's criterion), gas purity test, standard deviation test and test of the data processor are also included in the program. (author)

  12. Methods & tools for publishing & reusing linked open statistical data

    NARCIS (Netherlands)

    Tambouris, Efthimios; Kalampokis, Evangelos; Janssen, M.F.W.H.A.; Krimmer, Robert; Tarabanis, Konstantinos

    2017-01-01

    The number of open data available for reuse is rapidly increasing. A large number of these data are numerical thus can be easily visualized. Linked open data technology enables easy reuse and linking of data residing in di.erent locations. In this workshop, we will present a number of

  13. Statistical Analysis of Clinical Data on a Pocket Calculator, Part 2 Statistics on a Pocket Calculator, Part 2

    CERN Document Server

    Cleophas, Ton J

    2012-01-01

    The first part of this title contained all statistical tests relevant to starting clinical investigations, and included tests for continuous and binary data, power, sample size, multiple testing, variability, confounding, interaction, and reliability. The current part 2 of this title reviews methods for handling missing data, manipulated data, multiple confounders, predictions beyond observation, uncertainty of diagnostic tests, and the problems of outliers. Also robust tests, non-linear modeling , goodness of fit testing, Bhatacharya models, item response modeling, superiority testing, variab

  14. Statistical insights from Romanian data on higher education

    Directory of Open Access Journals (Sweden)

    Andreea Ardelean

    2015-09-01

    Full Text Available This paper aims to use cluster analysis to make a comparative analysis at regional level concerning the Romanian higher education. The evolution of higher education in post-communist period will also be presented, using quantitative traits. Although the focus is on university education, this will also include references to the total education by comparison. Then, to highlight the importance of higher education, the chi-square test will be applied to check whether there is an association between statistical regions and education level of the unemployed.

  15. Introduction to statistical data analysis for the life sciences

    CERN Document Server

    Ekstrom, Claus Thorn

    2014-01-01

    This text provides a computational toolbox that enables students to analyze real datasets and gain the confidence and skills to undertake more sophisticated analyses. Although accessible with any statistical software, the text encourages a reliance on R. For those new to R, an introduction to the software is available in an appendix. The book also includes end-of-chapter exercises as well as an entire chapter of case exercises that help students apply their knowledge to larger datasets and learn more about approaches specific to the life sciences.

  16. Statistical Analysis of Hypercalcaemia Data related to Transferability

    DEFF Research Database (Denmark)

    Frølich, Anne; Nielsen, Bo Friis

    2005-01-01

    In this report we describe statistical analysis related to a study of hypercalcaemia carried out in the Copenhagen area in the ten year period from 1984 to 1994. Results from the study have previously been publised in a number of papers [3, 4, 5, 6, 7, 8, 9] and in various abstracts and posters...... at conferences during the late eighties and early nineties. In this report we give a more detailed description of many of the analysis and provide some new results primarily by simultaneous studies of several databases....

  17. Medicare Advantage Rates and Statistics - FFS Data (1998-...

    Data.gov (United States)

    U.S. Department of Health & Human Services — Medicare fee-for-service data for each county broken out by aged, disabled, and ESRD beneficiaries including data on total Medicare fee-for-service reimbursement and...

  18. North American Transportation Statistics Database - Data Mining Tool

    Data.gov (United States)

    Department of Transportation — Contains tables of data for the United States, Canada, and Mexico. Data tables are divided up into 12 categories, including a country overview, transportation flows,...

  19. Medicare Advantage Rates and Statistics - FFS Data 2008-2014

    Data.gov (United States)

    U.S. Department of Health & Human Services — Medicare fee-for-service data for each county broken out by aged, disabled, and ESRD beneficiaries including data on total Medicare fee-for-service reimbursement and...

  20. Statistical and Machine-Learning Data Mining Techniques for Better Predictive Modeling and Analysis of Big Data

    CERN Document Server

    Ratner, Bruce

    2011-01-01

    The second edition of a bestseller, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data is still the only book, to date, to distinguish between statistical data mining and machine-learning data mining. The first edition, titled Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data, contained 17 chapters of innovative and practical statistical data mining techniques. In this second edition, renamed to reflect the increased coverage of machine-learning data mining techniques, the author has

  1. Data and Statistics on New York's Mining Resources - NYS Dept. of

    Science.gov (United States)

    ): Search DEC D E C banner Home » Lands and Waters » Mining & Reclamation » Data and Statistics on New York's Mining Resources Skip to main navigation Data and Statistics on New York's Mining Resources Statistics on New York's Mining Resources: Mines in New York - Information on active mines in New York State

  2. Radar Derived Spatial Statistics of Summer Rain. Volume 2; Data Reduction and Analysis

    Science.gov (United States)

    Konrad, T. G.; Kropfli, R. A.

    1975-01-01

    Data reduction and analysis procedures are discussed along with the physical and statistical descriptors used. The statistical modeling techniques are outlined and examples of the derived statistical characterization of rain cells in terms of the several physical descriptors are presented. Recommendations concerning analyses which can be pursued using the data base collected during the experiment are included.

  3. Statistical yearbook. 1995 Data available as of 30 June 1997. 42. ed.

    International Nuclear Information System (INIS)

    1997-01-01

    This is the forty-second issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1985-1994 or 1986-1995, using statistics available to the Statistics Division up to 30 June 1997. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources

  4. Statistical yearbook 1993. Data available as of 31 December 1994. 40 ed.

    International Nuclear Information System (INIS)

    1995-01-01

    This is the fortieth issue of the United Nations Statistical Yearbook, prepared by the Statistical Division, Department for Economic and Social Information and Policy Analysis of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1983-1992 or 1984-1993, using statistics available to the Statistical Division up to 31 December 1994. The Yearbook is based on data compiled by the Statistical Division from over 40 different international and national sources

  5. Statistical yearbook. 1996. Data available as of 30 September 1988. 43 ed.

    International Nuclear Information System (INIS)

    1999-01-01

    This is the forty-third issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1986-1995 or 1987-1996, using statistics available to the Statistics Division up to 30 September 1998. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources

  6. Statistical yearbook 1994. Data available as of 31 March 1996. 41 ed.

    International Nuclear Information System (INIS)

    1996-01-01

    This is the forty-first issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department for Economic and Social Information and Policy Analysis of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1984-1993 or 1985-1994, using statistics available to the Statistics Division up to 31 December 1995. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources

  7. Statistical Data Mining for Efficient Quality Control in Manufacturing

    DEFF Research Database (Denmark)

    Khan, Abdul Rauf; Schiøler, Henrik; Knudsen, Torben Steen

    2015-01-01

    of the process e.g sensor measurements, machine readings etc, and the major contributor of these big data sets are different quality control processes. In this article we will present methodology to extract valuable insight from manufacturing data. The proposed methodology is based on comparison of probabilities...

  8. Statistical modelling and deconvolution of yield meter data

    DEFF Research Database (Denmark)

    Tøgersen, Frede Aakmann; Waagepetersen, Rasmus Plenge

    Data for yield maps can be obtained from modern combine harvesters equipped with a differential global positioning system and a yield monitoring system. Due to delay and smoothing effects in the combine harvester the recorded yield data for a location represents a shifted weighted average of yiel...

  9. Statistical modeling for visualization evaluation through data fusion.

    Science.gov (United States)

    Chen, Xiaoyu; Jin, Ran

    2017-11-01

    There is a high demand of data visualization providing insights to users in various applications. However, a consistent, online visualization evaluation method to quantify mental workload or user preference is lacking, which leads to an inefficient visualization and user interface design process. Recently, the advancement of interactive and sensing technologies makes the electroencephalogram (EEG) signals, eye movements as well as visualization logs available in user-centered evaluation. This paper proposes a data fusion model and the application procedure for quantitative and online visualization evaluation. 15 participants joined the study based on three different visualization designs. The results provide a regularized regression model which can accurately predict the user's evaluation of task complexity, and indicate the significance of all three types of sensing data sets for visualization evaluation. This model can be widely applied to data visualization evaluation, and other user-centered designs evaluation and data analysis in human factors and ergonomics. Copyright © 2016 Elsevier Ltd. All rights reserved.

  10. China Dimensions Data Collection: Agricultural Statistics of the People's Republic of China: 1949-1990

    Data.gov (United States)

    National Aeronautics and Space Administration — Agricultural Statistics of the People's Republic of China, 1949-1990 is an historical collection of agricultural statistical data compiled by China's State...

  11. Healthcare-Associated Infections (HAIs) Data and Statistics

    Science.gov (United States)

    ... 2016 2015 HAI Data Report 2015 SIRs Using Historical Baselines 2014 HAI Progress Report FAQs: 2014 HAI ... National 2015 Standardized Infection Ratios (SIRs) Calculated Using Historical Baselines CDC’s annual National and State Healthcare-Associated ...

  12. National Vaccine Injury Compensation Program (VICP) -Data & Statistics

    Data.gov (United States)

    U.S. Department of Health & Human Services — The VICP program publishes a summary PDF report with several data tables: Number of Petitions Filed by Adjudication Categories by Alleged Vaccine, including # of...

  13. Lombardy (Italy) regional energy balance: 1984-1990 statistical data

    International Nuclear Information System (INIS)

    Berra, P.; Di Marzio, T.

    1992-01-01

    After a brief explanation of the scope and key econometric elements of the energy balance analysis, this paper tables energy supply and demand data for Italy's Lombardy Region. The primary and secondary energy data are expressed in metric quantities and in equivalent calorific values and are sub-divided according to type of energy source and consuming sector. Assessments are made of the degree of reliability of the information and sources of information

  14. [Insect cholinesterases and irreversible inhibitors. Statistical treatment of the data].

    Science.gov (United States)

    Moralev, S N

    2010-01-01

    The data on sensitivity of cholinesterases (ChE) of different insects to reversible inhibitors, as well as the data on physico-chemical parameters of amino acids constituting their active centers, were treated by factor analysis and juxtaposed. It is shown that both these characteristics are related to taxonomical belonging of insects. It is revealed the "material substrate" of the factors determining inhibitor action specificity, which are specific sites in ChE active center.

  15. Machine Learning Algorithms for Statistical Patterns in Large Data Sets

    Science.gov (United States)

    2018-02-01

    SUBJECT TERMS Text Analysis, Text Exploitation, Situation Awareness of Text , Document Processing, Document Ingestion, Full Text Search, Information...Assortativity: Proclivity Index for Attributed Networks (PRONE).” Pacific-Asia Conference on Knowledge Discovery and Data Mining , 2017. pp. 225-237...international conference on Knowledge discovery and data mining , 2013. pp. 212-220. [18] Sutherland, D.J., Xiong, L., Póczos, B., and Schneider, J

  16. Statistical characteristics of surrogate data based on geophysical measurements

    Directory of Open Access Journals (Sweden)

    V. Venema

    2006-01-01

    Full Text Available In this study, the statistical properties of a range of measurements are compared with those of their surrogate time series. Seven different records are studied, amongst others, historical time series of mean daily temperature, daily rain sums and runoff from two rivers, and cloud measurements. Seven different algorithms are used to generate the surrogate time series. The best-known method is the iterative amplitude adjusted Fourier transform (IAAFT algorithm, which is able to reproduce the measured distribution as well as the power spectrum. Using this setup, the measurements and their surrogates are compared with respect to their power spectrum, increment distribution, structure functions, annual percentiles and return values. It is found that the surrogates that reproduce the power spectrum and the distribution of the measurements are able to closely match the increment distributions and the structure functions of the measurements, but this often does not hold for surrogates that only mimic the power spectrum of the measurement. However, even the best performing surrogates do not have asymmetric increment distributions, i.e., they cannot reproduce nonlinear dynamical processes that are asymmetric in time. Furthermore, we have found deviations of the structure functions on small scales.

  17. Security of statistical data bases: invasion of privacy through attribute correlational modeling

    Energy Technology Data Exchange (ETDEWEB)

    Palley, M.A.

    1985-01-01

    This study develops, defines, and applies a statistical technique for the compromise of confidential information in a statistical data base. Attribute Correlational Modeling (ACM) recognizes that the information contained in a statistical data base represents real world statistical phenomena. As such, ACM assumes correlational behavior among the database attributes. ACM proceeds to compromise confidential information through creation of a regression model, where the confidential attribute is treated as the dependent variable. The typical statistical data base may preclude the direct application of regression. In this scenario, the research introduces the notion of a synthetic data base, created through legitimate queries of the actual data base, and through proportional random variation of responses to these queries. The synthetic data base is constructed to resemble the actual data base as closely as possible in a statistical sense. ACM then applies regression analysis to the synthetic data base, and utilizes the derived model to estimate confidential information in the actual database.

  18. Data analysis of asymmetric structures advanced approaches in computational statistics

    CERN Document Server

    Saito, Takayuki

    2004-01-01

    Data Analysis of Asymmetric Structures provides a comprehensive presentation of a variety of models and theories for the analysis of asymmetry and its applications and provides a wealth of new approaches in every section. It meets both the practical and theoretical needs of research professionals across a wide range of disciplines and  considers data analysis in fields such as psychology, sociology, social science, ecology, and marketing. In seven comprehensive chapters this guide details theories, methods, and models for the analysis of asymmetric structures in a variety of disciplines and presents future opportunities and challenges affecting research developments and business applications.

  19. Understanding the coverage of Statistics New Zealand's Integrated Data Infrastructure

    Directory of Open Access Journals (Sweden)

    Gareth Minshall

    2017-04-01

    Identifying NZ residents at a given time, and quantifying errors in administrative data sources will assist researchers ability to recognise and adjust for these errors in their analysis. Simply quantifying (often for the first time the limitations of administrative sources also provides impetus to improving the collection of these variables at source.

  20. Statistical Bayesian method for reliability evaluation based on ADT data

    Science.gov (United States)

    Lu, Dawei; Wang, Lizhi; Sun, Yusheng; Wang, Xiaohong

    2018-05-01

    Accelerated degradation testing (ADT) is frequently conducted in the laboratory to predict the products’ reliability under normal operating conditions. Two kinds of methods, degradation path models and stochastic process models, are utilized to analyze degradation data and the latter one is the most popular method. However, some limitations like imprecise solution process and estimation result of degradation ratio still exist, which may affect the accuracy of the acceleration model and the extrapolation value. Moreover, the conducted solution of this problem, Bayesian method, lose key information when unifying the degradation data. In this paper, a new data processing and parameter inference method based on Bayesian method is proposed to handle degradation data and solve the problems above. First, Wiener process and acceleration model is chosen; Second, the initial values of degradation model and parameters of prior and posterior distribution under each level is calculated with updating and iteration of estimation values; Third, the lifetime and reliability values are estimated on the basis of the estimation parameters; Finally, a case study is provided to demonstrate the validity of the proposed method. The results illustrate that the proposed method is quite effective and accuracy in estimating the lifetime and reliability of a product.

  1. Statistical analysis of longitudinal network data with changing composition

    NARCIS (Netherlands)

    Huisman, M; Snijders, TAB; Snijders, Tom A.B.

    2003-01-01

    Markov chains can be used for the modeling of complex longitudinal network data. One class of probability models to model the evolution of social networks are stochastic actor-oriented models for network change proposed by Snijders. These models are continuous-time Markov chain models that are

  2. Statistical modelling and deconvolution of yield meter data

    DEFF Research Database (Denmark)

    Tøgersen, Frede Aakmann; Waagepetersen, Rasmus Plenge

    2004-01-01

    and an impulse response function. This results in an unusual spatial covariance structure (depending on the driving pattern of the combine harverster) for the yield monitoring system data. Parameters of the impulse response function and the spatial covariance function of the yield are estimated using maximum...

  3. Reducing Statistical Noise in Airborne Gamma-Ray Data

    DEFF Research Database (Denmark)

    Hovgaard, Jens; Grasty, R. L.

    1997-01-01

    By using the Noise Adjusted Singular Value Decomposition (NASVD) technique it is possible to reconstruct the measured airborne gamma-ray spectra with a noise content that is significant smaller than the noise contained in the original measured spectra. The method can be used for improving the out...... the output of the data processing for example mapping of Th, U, and K distribution....

  4. Statistical modelling of spatio-temporal dependencies in NGS data

    NARCIS (Netherlands)

    Ranciati, Saverio

    2016-01-01

    Next-generation sequencing (NGS) heeft zich snel gevestigd als de huidige standaard in de genetische analyse. Deze omschakeling van microarray naar NGS vereist nieuwe statistische strategieën om de onderzoeksvragen aan te pakken. Ten eerste, NGS data bestaat uit discrete waarnemingen, meestal

  5. Integrating the statistical analysis of spatial data in ecology

    Science.gov (United States)

    A. M. Liebhold; J. Gurevitch

    2002-01-01

    In many areas of ecology there is an increasing emphasis on spatial relationships. Often ecologists are interested in new ways of analyzing data with the objective of quantifying spatial patterns, and in designing surveys and experiments in light of the recognition that there may be underlying spatial pattern in biotic responses. In doing so, ecologists have adopted a...

  6. Sharing Privacy Protected and Statistically Sound Clinical Research Data Using Outsourced Data Storage

    Directory of Open Access Journals (Sweden)

    Geontae Noh

    2014-01-01

    Full Text Available It is critical to scientific progress to share clinical research data stored in outsourced generally available cloud computing services. Researchers are able to obtain valuable information that they would not otherwise be able to access; however, privacy concerns arise when sharing clinical data in these outsourced publicly available data storage services. HIPAA requires researchers to deidentify private information when disclosing clinical data for research purposes and describes two available methods for doing so. Unfortunately, both techniques degrade statistical accuracy. Therefore, the need to protect privacy presents a significant problem for data sharing between hospitals and researchers. In this paper, we propose a controlled secure aggregation protocol to secure both privacy and accuracy when researchers outsource their clinical research data for sharing. Since clinical data must remain private beyond a patient’s lifetime, we take advantage of lattice-based homomorphic encryption to guarantee long-term security against quantum computing attacks. Using lattice-based homomorphic encryption, we design an aggregation protocol that aggregates outsourced ciphertexts under distinct public keys. It enables researchers to get aggregated results from outsourced ciphertexts of distinct researchers. To the best of our knowledge, our protocol is the first aggregation protocol which can aggregate ciphertexts which are encrypted with distinct public keys.

  7. Statistical problems raised by data processing of food surveys

    International Nuclear Information System (INIS)

    Lacourly, Nancy

    1974-01-01

    The methods used for the analysis of dietary habits of national populations - food surveys - have been studied. S. Lederman's linear model for the estimation of the average individual consumptions from the total family diets was in the light of a food survey carried on with 250 Roman families in 1969. An important bias in the estimates thus obtained was shown out by a simulation assuming 'housewife's dictatorship'; these assumptions should contribute to set up an unbiased model. Several techniques of multidimensional analysis were therefore used and the theoretical aspect of linear regression for some particular situations had to be investigated: quasi-colinear 'independent variables', measurements with errors, positive constraints on regression coefficients. A new survey methodology was developed taking account of the new 'Integrated Information Systems', which have incidence on all the stages of a consumption survey: organization, data collection, constitution of an information bank and data processing. (author) [fr

  8. Statistical Analysis Methods for the fMRI Data

    Directory of Open Access Journals (Sweden)

    Huseyin Boyaci

    2011-08-01

    Full Text Available Functional magnetic resonance imaging (fMRI is a safe and non-invasive way to assess brain functions by using signal changes associated with brain activity. The technique has become a ubiquitous tool in basic, clinical and cognitive neuroscience. This method can measure little metabolism changes that occur in active part of the brain. We process the fMRI data to be able to find the parts of brain that are involve in a mechanism, or to determine the changes that occur in brain activities due to a brain lesion. In this study we will have an overview over the methods that are used for the analysis of fMRI data.

  9. Practical guidance for statistical analysis of operational event data

    International Nuclear Information System (INIS)

    Atwood, C.L.

    1995-10-01

    This report presents ways to avoid mistakes that are sometimes made in analysis of operational event data. It then gives guidance on what to do when a model is rejected, a list of standard types of models to consider, and principles for choosing one model over another. For estimating reliability, it gives advice on which failure modes to model, and moment formulas for combinations of failure modes. The issues are illustrated with many examples and case studies

  10. Enhanced Statistical Estimation of Air Temperature Incorporating Nighttime Light Data

    Directory of Open Access Journals (Sweden)

    Yunhao Chen

    2016-08-01

    Full Text Available Near surface air temperature (Ta is one of the most critical variables in climatology, hydrology, epidemiology, and environmental health. In situ measurements are not efficient for characterizing spatially heterogeneous Ta, while remote sensing is a powerful tool to break this limitation. This study proposes a mapping framework for daily mean Ta using an enhanced empirical regression method based on remote sensing data. It differs from previous studies in three aspects. First, nighttime light data is introduced as a predictor (besides land surface temperature, normalized difference vegetation index, impervious surface area, black sky albedo, normalized difference water index, elevation, and duration of daylight considering the urbanization-induced Ta increase over a large area. Second, independent components are extracted using principal component analysis considering the correlations among the above predictors. Third, a composite sinusoidal coefficient regression is developed considering the dynamic Ta-predictor relationship. This method was performed at 333 weather stations in China during 2001–2012. Evaluation shows overall mean error of −0.01 K, root mean square error (RMSE of 2.53 K, correlation coefficient (R2 of 0.96, and average uncertainty of 0.21 K. Model inter-comparison shows that this method outperforms six additional empirical regressions that have not incorporated nighttime light data or considered predictor independence or coefficient dynamics (by 0.18–2.60 K in RMSE and 0.00–0.15 in R2.

  11. Sources of Safety Data and Statistical Strategies for Design and Analysis: Transforming Data Into Evidence.

    Science.gov (United States)

    Ma, Haijun; Russek-Cohen, Estelle; Izem, Rima; Marchenko, Olga V; Jiang, Qi

    2018-03-01

    Safety evaluation is a key aspect of medical product development. It is a continual and iterative process requiring thorough thinking, and dedicated time and resources. In this article, we discuss how safety data are transformed into evidence to establish and refine the safety profile of a medical product, and how the focus of safety evaluation, data sources, and statistical methods change throughout a medical product's life cycle. Some challenges and statistical strategies for medical product safety evaluation are discussed. Examples of safety issues identified in different periods, that is, premarketing and postmarketing, are discussed to illustrate how different sources are used in the safety signal identification and the iterative process of safety assessment. The examples highlighted range from commonly used pediatric vaccine given to healthy children to medical products primarily used to treat a medical condition in adults. These case studies illustrate that different products may require different approaches, and once a signal is discovered, it could impact future safety assessments. Many challenges still remain in this area despite advances in methodologies, infrastructure, public awareness, international harmonization, and regulatory enforcement. Innovations in safety assessment methodologies are pressing in order to make the medical product development process more efficient and effective, and the assessment of medical product marketing approval more streamlined and structured. Health care payers, providers, and patients may have different perspectives when weighing in on clinical, financial and personal needs when therapies are being evaluated.

  12. 1992 statistical data on electricity consumption and generation in Bulgaria

    International Nuclear Information System (INIS)

    Georgiev, A.

    1993-01-01

    The report provides data on monthly power consumption based on peak-day load profiles, power balance of electric system in peak hours of absolute maximal load and analysis of usability of different kinds of power plants in the country. The total energy production for all power plants for 1992 is 35555 ml kWh - 48.6 % TPP, 32.5% NPP, 5.8% HPP, 8.5% in-house TPP, 4.6 district heating PP. 14 tabs., 5 figs. (author)

  13. Data Representations, Transformations, and Statistics for Visual Reasoning

    CERN Document Server

    Maciejewski, Ross

    2011-01-01

    Analytical reasoning techniques are methods by which users explore their data to obtain insight and knowledge that can directly support situational awareness and decision making. Recently, the analytical reasoning process has been augmented through the use of interactive visual representations and tools which utilize cognitive, design and perceptual principles. These tools are commonly referred to as visual analytics tools, and the underlying methods and principles have roots in a variety of disciplines. This chapter provides an introduction to young researchers as an overview of common visual

  14. Statistical data on energy. France; Statistiques energetiques. France

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2002-05-01

    This document summarizes in a series of tables the energy status of France for the year 2001: supplies, consumption and uses, national production, evolution per energy source and per sector of the national production and consumption since 1973, general indicators (evolution of the energy bill, prices, energy independence and gross internal product since 1973), projections. Details about the resources, uses and prices are given separately for petroleum, natural gas, electricity and solid mineral fuels and compared with the average data of the European Union. (J.S.)

  15. The OptIC Data Assimilation Intercomparison: A Statistical Critique

    Science.gov (United States)

    Enting, I. G.; Clisby, N.

    2008-12-01

    The development of improved terrestrial carbon models has assumed great importance because of concerns about significant climate-to-carbon feedback processes. The complexity of the interactions leads to considerable difficulties in the process of model calibration. The OptIC intercomparison explored some aspects of model calibration, using an idealised terrestrial carbon model. Participants were invited to estimate model parameters in various cases defined by specified time series of the model state, with various forms of added noise. The study identified the crucial importance of the choice of cost function. The present analysis revisits the OptIC study, by considering it as an exercise in statistical estimation. This treats the observations as random variables. Consequently parameter estimates, â, based on observations will also be random variables whose distribution is known as the 'sampling distribution'. Key questions for any specific case are: Are departures from â/a_true =1 indication of bias or sampling error? Under what circumstance are uncertainty estimates (of Var[â]) reliable? We consider cases where the estimate is obtained by minimising a cost function, ΘX. Assuming that we know the true form of ℓ, the log likelihood, there are three different characterisations of uncertainty that should be distinguished: (i) The uncertainty from maximum-likelihood estimates, corresponding (either exactly or asymptotically) to the Cramer-Rao bound. In a realistic calibration situation, we won't be able to determine this because the 'true' form of the likelihood is unknown. (ii) The actual uncertainty associated with using a particular cost function. If the true noise distribution is known, this can be calculated in simple cases and determined from simulations in more complicated cases. (iii) The 'formal uncertainty' based on assuming (usually incorrectly) that ΘX is the true likelihood. In the first stage of the analysis, the distinctions are illustrated by

  16. Energy demand forecasting method based on international statistical data

    International Nuclear Information System (INIS)

    Glanc, Z.; Kerner, A.

    1997-01-01

    Poland is in a transition phase from a centrally planned to a market economy; data collected under former economic conditions do not reflect a market economy. Final energy demand forecasts are based on the assumption that the economic transformation in Poland will gradually lead the Polish economy, technologies and modes of energy use, to the same conditions as mature market economy countries. The starting point has a significant influence on the future energy demand and supply structure: final energy consumption per capita in 1992 was almost half the average of OECD countries; energy intensity, based on Purchasing Power Parities (PPP) and referred to GDP, is more than 3 times higher in Poland. A method of final energy demand forecasting based on regression analysis is described in this paper. The input data are: output of macroeconomic and population growth forecast; time series 1970-1992 of OECD countries concerning both macroeconomic characteristics and energy consumption; and energy balance of Poland for the base year of the forecast horizon. (author). 1 ref., 19 figs, 4 tabs

  17. Energy demand forecasting method based on international statistical data

    Energy Technology Data Exchange (ETDEWEB)

    Glanc, Z; Kerner, A [Energy Information Centre, Warsaw (Poland)

    1997-09-01

    Poland is in a transition phase from a centrally planned to a market economy; data collected under former economic conditions do not reflect a market economy. Final energy demand forecasts are based on the assumption that the economic transformation in Poland will gradually lead the Polish economy, technologies and modes of energy use, to the same conditions as mature market economy countries. The starting point has a significant influence on the future energy demand and supply structure: final energy consumption per capita in 1992 was almost half the average of OECD countries; energy intensity, based on Purchasing Power Parities (PPP) and referred to GDP, is more than 3 times higher in Poland. A method of final energy demand forecasting based on regression analysis is described in this paper. The input data are: output of macroeconomic and population growth forecast; time series 1970-1992 of OECD countries concerning both macroeconomic characteristics and energy consumption; and energy balance of Poland for the base year of the forecast horizon. (author). 1 ref., 19 figs, 4 tabs.

  18. Statistical inconsistencies in the KiDS-450 data set

    Science.gov (United States)

    Efstathiou, George; Lemos, Pablo

    2018-05-01

    The Kilo-Degree Survey (KiDS) has been used in several recent papers to infer constraints on the amplitude of the matter power spectrum and matter density at low redshift. Some of these analyses have claimed tension with the Planck Λ cold dark matter cosmology at the ˜2σ-3σ level, perhaps indicative of new physics. However, Planck is consistent with other low-redshift probes of the matter power spectrum such as redshift-space distortions and the combined galaxy-mass and galaxy-galaxy power spectra. Here, we perform consistency tests of the KiDS data, finding internal tensions for various cuts of the data at ˜2.2σ-3.5σ significance. Until these internal tensions are understood, we argue that it is premature to claim evidence for new physics from KiDS. We review the consistency between KiDS and other weak lensing measurements of S8, highlighting the importance of intrinsic alignments for precision cosmology.

  19. Small Sample Statistics for Incomplete Nonnormal Data: Extensions of Complete Data Formulae and a Monte Carlo Comparison

    Science.gov (United States)

    Savalei, Victoria

    2010-01-01

    Incomplete nonnormal data are common occurrences in applied research. Although these 2 problems are often dealt with separately by methodologists, they often cooccur. Very little has been written about statistics appropriate for evaluating models with such data. This article extends several existing statistics for complete nonnormal data to…

  20. Developing Statistical Literacy Using Real-World Data: Investigating Socioeconomic Secondary Data Resources Used in Research and Teaching

    Science.gov (United States)

    Carter, Jackie; Noble, Susan; Russell, Andrew; Swanson, Eric

    2011-01-01

    Increasing volumes of statistical data are being made available on the open web, including from the World Bank. This "data deluge" provides both opportunities and challenges. Good use of these data requires statistical literacy. This paper presents results from a project that set out to better understand how socioeconomic secondary data…

  1. Development of statistical analysis code for meteorological data (W-View)

    International Nuclear Information System (INIS)

    Tachibana, Haruo; Sekita, Tsutomu; Yamaguchi, Takenori

    2003-03-01

    A computer code (W-View: Weather View) was developed to analyze the meteorological data statistically based on 'the guideline of meteorological statistics for the safety analysis of nuclear power reactor' (Nuclear Safety Commission on January 28, 1982; revised on March 29, 2001). The code gives statistical meteorological data to assess the public dose in case of normal operation and severe accident to get the license of nuclear reactor operation. This code was revised from the original code used in a large office computer code to enable a personal computer user to analyze the meteorological data simply and conveniently and to make the statistical data tables and figures of meteorology. (author)

  2. Toward Global Comparability of Sexual Orientation Data in Official Statistics: A Conceptual Framework of Sexual Orientation for Health Data Collection in New Zealand's Official Statistics System

    Science.gov (United States)

    Gray, Alistair; Veale, Jaimie F.; Binson, Diane; Sell, Randell L.

    2013-01-01

    Objective. Effectively addressing health disparities experienced by sexual minority populations requires high-quality official data on sexual orientation. We developed a conceptual framework of sexual orientation to improve the quality of sexual orientation data in New Zealand's Official Statistics System. Methods. We reviewed conceptual and methodological literature, culminating in a draft framework. To improve the framework, we held focus groups and key-informant interviews with sexual minority stakeholders and producers and consumers of official statistics. An advisory board of experts provided additional guidance. Results. The framework proposes working definitions of the sexual orientation topic and measurement concepts, describes dimensions of the measurement concepts, discusses variables framing the measurement concepts, and outlines conceptual grey areas. Conclusion. The framework proposes standard definitions and concepts for the collection of official sexual orientation data in New Zealand. It presents a model for producers of official statistics in other countries, who wish to improve the quality of health data on their citizens. PMID:23840231

  3. Integrated Data Collection Analysis (IDCA) Program - Statistical Analysis of RDX Standard Data Sets

    Energy Technology Data Exchange (ETDEWEB)

    Sandstrom, Mary M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Brown, Geoffrey W. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Preston, Daniel N. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Pollard, Colin J. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Warner, Kirstin F. [Naval Surface Warfare Center (NSWC), Indian Head, MD (United States). Indian Head Division; Sorensen, Daniel N. [Naval Surface Warfare Center (NSWC), Indian Head, MD (United States). Indian Head Division; Remmers, Daniel L. [Naval Surface Warfare Center (NSWC), Indian Head, MD (United States). Indian Head Division; Phillips, Jason J. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Shelley, Timothy J. [Air Force Research Lab. (AFRL), Tyndall AFB, FL (United States); Reyes, Jose A. [Applied Research Associates, Tyndall AFB, FL (United States); Hsu, Peter C. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Reynolds, John G. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

    2015-10-30

    The Integrated Data Collection Analysis (IDCA) program is conducting a Proficiency Test for Small- Scale Safety and Thermal (SSST) testing of homemade explosives (HMEs). Described here are statistical analyses of the results for impact, friction, electrostatic discharge, and differential scanning calorimetry analysis of the RDX Type II Class 5 standard. The material was tested as a well-characterized standard several times during the proficiency study to assess differences among participants and the range of results that may arise for well-behaved explosive materials. The analyses show that there are detectable differences among the results from IDCA participants. While these differences are statistically significant, most of them can be disregarded for comparison purposes to assess potential variability when laboratories attempt to measure identical samples using methods assumed to be nominally the same. The results presented in this report include the average sensitivity results for the IDCA participants and the ranges of values obtained. The ranges represent variation about the mean values of the tests of between 26% and 42%. The magnitude of this variation is attributed to differences in operator, method, and environment as well as the use of different instruments that are also of varying age. The results appear to be a good representation of the broader safety testing community based on the range of methods, instruments, and environments included in the IDCA Proficiency Test.

  4. Statistical analysis of laser-interferometric detector Dylkin-1 data and data on seismic activity

    International Nuclear Information System (INIS)

    Kirillov, R S; Bochkarev, V V; Dulkyn, Academy of Sciences of the Republic of Tatarstan (Russian Federation))" data-affiliation=" (Scientific Center of Gravitational-Wave Research Dulkyn, Academy of Sciences of the Republic of Tatarstan (Russian Federation))" >Skochilov, A F

    2014-01-01

    This work presents statistical analysis of data collected from laser interferometric detector ''Dylkin-1'' and nearby seismic stations. The final goal of Dylkin project consists in creating detector of theoretically predicted gravitational waves produced by binary relativistic astrophysical objects. Currently, works are underway to improve sensitivity of detector by 2-3 orders. The goals of this research were to test isolation of detector from noise caused by seismic waves and to find out whether it is sensitive to variations in the gradient of gravitational potential (acceleration of free fall) caused by free Earth oscillations. Noise isolation has been tested by comparing energy of signals during significant seismic events. Sensitivity to variations in acceleration of free fall has been tested by means of cross-spectral analysis

  5. Strategies for improving utilization of computerized statistical data by the social science community.

    OpenAIRE

    Robbin, Alice

    1981-01-01

    In recent decades there has been a notable expansion of statistical data produced by the public and private sectors for administrative, research, policy and evaluation programs. This is due to advances in relatively inexpensive and efficient data collection and management of computer-readable statistical data. Corresponding changes have not occurred in the management of data collection, preservation, description and dissemination. As a result, the process by which data become accessible to so...

  6. Statistical Power Analysis with Missing Data A Structural Equation Modeling Approach

    CERN Document Server

    Davey, Adam

    2009-01-01

    Statistical power analysis has revolutionized the ways in which we conduct and evaluate research.  Similar developments in the statistical analysis of incomplete (missing) data are gaining more widespread applications. This volume brings statistical power and incomplete data together under a common framework, in a way that is readily accessible to those with only an introductory familiarity with structural equation modeling.  It answers many practical questions such as: How missing data affects the statistical power in a study How much power is likely with different amounts and types

  7. A method for statistical comparison of data sets and its uses in analysis of nuclear physics data

    International Nuclear Information System (INIS)

    Bityukov, S.I.; Smirnova, V.V.; Krasnikov, N.V.; Maksimushkina, A.V.; Nikitenko, A.N.

    2014-01-01

    Authors propose a method for statistical comparison of two data sets. The method is based on the method of statistical comparison of histograms. As an estimator of quality of the decision made, it is proposed to use the value which it is possible to call the probability that the decision (data sets are various) is correct [ru

  8. Statistical means to enhance the comparability of data within a pooled analysis of individual data in neurobehavioral toxicology

    DEFF Research Database (Denmark)

    Meyer-Baron, Monika; Schäper, Michael; Knapp, Guido

    2011-01-01

    Meta-analyses of individual participant data (IPD) provide important contributions to toxicological risk assessments. However, comparability of individual data cannot be taken for granted when information from different studies has to be summarized. By means of statistical standardization...

  9. Novel Kalman filter algorithm for statistical monitoring of extensive landscapes with synoptic sensor data

    Science.gov (United States)

    Raymond L. Czaplewski

    2015-01-01

    Wall-to-wall remotely sensed data are increasingly available to monitor landscape dynamics over large geographic areas. However, statistical monitoring programs that use post-stratification cannot fully utilize those sensor data. The Kalman filter (KF) is an alternative statistical estimator. I develop a new KF algorithm that is numerically robust with large numbers of...

  10. 42 CFR 417.806 - Financial records, statistical data, and cost finding.

    Science.gov (United States)

    2010-10-01

    ... 42 Public Health 3 2010-10-01 2010-10-01 false Financial records, statistical data, and cost... MEDICAL PLANS, AND HEALTH CARE PREPAYMENT PLANS Health Care Prepayment Plans § 417.806 Financial records, statistical data, and cost finding. (a) The principles specified in § 417.568 apply to HCPPs, except those in...

  11. Statistical yearbook. 2000. Data available as of 31 January 2003. 47 ed

    International Nuclear Information System (INIS)

    2003-01-01

    This is the forty-seventh issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1989-1998 or 1990-1999, using statistics available to the Statistics Division up to 30 November 2000. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources. These include the United Nations Statistics Division in the fields of national accounts, industry, energy, transport and international trade; the United Nations Statistics Division and Population Division in the field of demographic statistics; and data provided by over 20 offices of the United Nations system and international organizations in other specialized fields.United Nations agencies and other international organizations which furnished data are listed under 'Statistical sources and references' at the end of the Yearbook. Acknowledgement is gratefully made for their generous cooperation in providing data. The Statistics Division also publishes the Monthly Bulletin of Statistics, which provides a valuable complement to the Yearbook covering current international economic statistics for most countries and areas of the world and quarterly world and regional aggregates. Subscribers to the Monthly Bulletin of Statistics may also access the Bulletin on-line via the World Wide Web on Internet. MBS On-line allows time-sensitive statistics to reach users much faster than the traditional print publication. For further information see . The present issue of the Yearbook reflects a phased programme of major changes in its organization and presentation undertaken in 1990 which until then was relatively unchanged since the first issue was released in 1948. The Yearbook has also been published on CD-ROM for IBM-compatible microcomputers, since the thirty-eighth issue

  12. A system for classifying wood-using industries and recording statistics for automatic data processing.

    Science.gov (United States)

    E.W. Fobes; R.W. Rowe

    1968-01-01

    A system for classifying wood-using industries and recording pertinent statistics for automatic data processing is described. Forms and coding instructions for recording data of primary processing plants are included.

  13. 78 FR 9055 - National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards...

    Science.gov (United States)

    2013-02-07

    ... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards Staff, Announces the..., Medical Systems Administrator, Classifications and Public Health Data Standards Staff, NCHS, 3311 Toledo...

  14. Software development for statistical handling of dosimetric and epidemiological data base

    International Nuclear Information System (INIS)

    Amaro, M.

    1990-01-01

    The dose records from different groups of occupationally exposed workers are available in a computerized data base whose main purpose is the individual dose follow-up. Apart from this objective, such a dosimetric data base can be useful to obtain statistical analysis. The type of statistical n formation that can be extracted from the data base may aim to attain mainly two kinds of objectives: - Individual and collective dose distributions and statistics. -Epidemiological statistics. The report describes the software developed to obtain the statistical reports required by the Regulatory Body, as well as any other type of dose distributions or statistics to be included in epidemiological studies A Users Guide for the operators who handle this software package, and the codes listings, are also included in the report. (Author) 2 refs

  15. Software development for statistical handling of dosimetric and epidemiological data base

    International Nuclear Information System (INIS)

    Amaro, M.

    1990-01-01

    The dose records from different group of occupationally exposed workers are available in a computerized data base whose main purpose is the individual dose follow-up. Apart from this objective, such a dosimetric data base can be useful to obtain statistical analysis. The type of statistical information that can be extracted from the data base may aim to attain mainly two kinds of obsectives: - Individual and collective dose distributions and statistics. - Epidemiological statistics. The report describes the software developed to obtain the statistical reports required by the Regulatory Body, as well as any other type of dose distributions or statistics to be included in epidsemiological studies. A Users Guide for the operators who handle this sofware package, and the codes listings, are also included in the report. (Author)

  16. AutoBayes: A System for Generating Data Analysis Programs from Statistical Models

    OpenAIRE

    Fischer, Bernd; Schumann, Johann

    2003-01-01

    Data analysis is an important scientific task which is required whenever information needs to be extracted from raw data. Statistical approaches to data analysis, which use methods from probability theory and numerical analysis, are well-founded but dificult to implement: the development of a statistical data analysis program for any given application is time-consuming and requires substantial knowledge and experience in several areas. In this paper, we describe AutoBayes, a program synthesis...

  17. The art of data analysis how to answer almost any question using basic statistics

    CERN Document Server

    Jarman, Kristin H

    2013-01-01

    A friendly and accessible approach to applying statistics in the real worldWith an emphasis on critical thinking, The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics presents fun and unique examples, guides readers through the entire data collection and analysis process, and introduces basic statistical concepts along the way.Leaving proofs and complicated mathematics behind, the author portrays the more engaging side of statistics and emphasizes its role as a problem-solving tool.  In addition, light-hearted case studies

  18. Statistical Redundancy Testing for Improved Gene Selection in Cancer Classification Using Microarray Data

    Directory of Open Access Journals (Sweden)

    J. Sunil Rao

    2007-01-01

    Full Text Available In gene selection for cancer classifi cation using microarray data, we define an eigenvalue-ratio statistic to measure a gene’s contribution to the joint discriminability when this gene is included into a set of genes. Based on this eigenvalueratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene selection methods. Simulation studies illustrate the agreement between statistical redundancy testing and gene selection methods. Real data examples show the proposed gene selection methods can select a compact gene subset which can not only be used to build high quality cancer classifiers but also show biological relevance.

  19. PROSA: A computer program for statistical analysis of near-real-time-accountancy (NRTA) data

    International Nuclear Information System (INIS)

    Beedgen, R.; Bicking, U.

    1987-04-01

    The computer program PROSA (Program for Statistical Analysis of NRTA Data) is a tool to decide on the basis of statistical considerations if, in a given sequence of materials balance periods, a loss of material might have occurred or not. The evaluation of the material balance data is based on statistical test procedures. In PROSA three truncated sequential tests are applied to a sequence of material balances. The manual describes the statistical background of PROSA and how to use the computer program on an IBM-PC with DOS 3.1. (orig.) [de

  20. Statistical yearbook 2005. Data available as of March 2006. 50 ed

    International Nuclear Information System (INIS)

    2006-08-01

    The Statistical Yearbook is an annual compilation of a wide range of international economic, social and environmental statistics on over 200 countries and areas, compiled from sources including UN agencies and other international, national and specialized organizations. The 50th issue contains data available to the Statistics Division as of March 2006 and presents them in 76 tables. The number of years of data shown in the tables varies from one to ten, with the ten-year tables covering 1994 to 2003 or 1995 to 2004. Accompanying the tables are technical notes providing brief descriptions of major statistical concepts, definitions and classifications

  1. Tips and Tricks for Successful Application of Statistical Methods to Biological Data.

    Science.gov (United States)

    Schlenker, Evelyn

    2016-01-01

    This chapter discusses experimental design and use of statistics to describe characteristics of data (descriptive statistics) and inferential statistics that test the hypothesis posed by the investigator. Inferential statistics, based on probability distributions, depend upon the type and distribution of the data. For data that are continuous, randomly and independently selected, as well as normally distributed more powerful parametric tests such as Student's t test and analysis of variance (ANOVA) can be used. For non-normally distributed or skewed data, transformation of the data (using logarithms) may normalize the data allowing use of parametric tests. Alternatively, with skewed data nonparametric tests can be utilized, some of which rely on data that are ranked prior to statistical analysis. Experimental designs and analyses need to balance between committing type 1 errors (false positives) and type 2 errors (false negatives). For a variety of clinical studies that determine risk or benefit, relative risk ratios (random clinical trials and cohort studies) or odds ratios (case-control studies) are utilized. Although both use 2 × 2 tables, their premise and calculations differ. Finally, special statistical methods are applied to microarray and proteomics data, since the large number of genes or proteins evaluated increase the likelihood of false discoveries. Additional studies in separate samples are used to verify microarray and proteomic data. Examples in this chapter and references are available to help continued investigation of experimental designs and appropriate data analysis.

  2. Testing independence of bivariate interval-censored data using modified Kendall's tau statistic.

    Science.gov (United States)

    Kim, Yuneung; Lim, Johan; Park, DoHwan

    2015-11-01

    In this paper, we study a nonparametric procedure to test independence of bivariate interval censored data; for both current status data (case 1 interval-censored data) and case 2 interval-censored data. To do it, we propose a score-based modification of the Kendall's tau statistic for bivariate interval-censored data. Our modification defines the Kendall's tau statistic with expected numbers of concordant and disconcordant pairs of data. The performance of the modified approach is illustrated by simulation studies and application to the AIDS study. We compare our method to alternative approaches such as the two-stage estimation method by Sun et al. (Scandinavian Journal of Statistics, 2006) and the multiple imputation method by Betensky and Finkelstein (Statistics in Medicine, 1999b). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  3. A Scan Statistic for Continuous Data Based on the Normal Probability Model

    OpenAIRE

    Konty, Kevin; Kulldorff, Martin; Huang, Lan

    2009-01-01

    Abstract Temporal, spatial and space-time scan statistics are commonly used to detect and evaluate the statistical significance of temporal and/or geographical disease clusters, without any prior assumptions on the location, time period or size of those clusters. Scan statistics are mostly used for count data, such as disease incidence or mortality. Sometimes there is an interest in looking for clusters with respect to a continuous variable, such as lead levels in children or low birth weight...

  4. A new method to determine the number of experimental data using statistical modeling methods

    Energy Technology Data Exchange (ETDEWEB)

    Jung, Jung-Ho; Kang, Young-Jin; Lim, O-Kaung; Noh, Yoojeong [Pusan National University, Busan (Korea, Republic of)

    2017-06-15

    For analyzing the statistical performance of physical systems, statistical characteristics of physical parameters such as material properties need to be estimated by collecting experimental data. For accurate statistical modeling, many such experiments may be required, but data are usually quite limited owing to the cost and time constraints of experiments. In this study, a new method for determining a rea- sonable number of experimental data is proposed using an area metric, after obtaining statistical models using the information on the underlying distribution, the Sequential statistical modeling (SSM) approach, and the Kernel density estimation (KDE) approach. The area metric is used as a convergence criterion to determine the necessary and sufficient number of experimental data to be acquired. The pro- posed method is validated in simulations, using different statistical modeling methods, different true models, and different convergence criteria. An example data set with 29 data describing the fatigue strength coefficient of SAE 950X is used for demonstrating the performance of the obtained statistical models that use a pre-determined number of experimental data in predicting the probability of failure for a target fatigue life.

  5. Statistical analysis of longitudinal quality of life data with missing measurements

    NARCIS (Netherlands)

    Zwinderman, A. H.

    1992-01-01

    The statistical analysis of longitudinal quality of life data in the presence of missing data is discussed. In cancer trials missing data are generated due to the fact that patients die, drop out, or are censored. These missing data are problematic in the monitoring of the quality of life during the

  6. Replicate This! Creating Individual-Level Data from Summary Statistics Using R

    Science.gov (United States)

    Morse, Brendan J.

    2013-01-01

    Incorporating realistic data and research examples into quantitative (e.g., statistics and research methods) courses has been widely recommended for enhancing student engagement and comprehension. One way to achieve these ends is to use a data generator to emulate the data in published research articles. "MorseGen" is a free data generator that…

  7. DATA MINING AND STATISTICS METHODS USAGE FOR ADVANCED TRAINING COURSES QUALITY MEASUREMENT: CASE STUDY

    Directory of Open Access Journals (Sweden)

    Maxim I. Galchenko

    2014-01-01

    Full Text Available In the article we consider a case of the analysis of the data connected with educational statistics, namely – result of professional development courses students survey with specialized software usage. Need for expanded statistical results processing, the scheme of carrying out the analysis is shown. Conclusions on a studied case are presented. 

  8. The Development Data Book: A Guide to Social and Economic Statistics. Second Edition.

    Science.gov (United States)

    Sheram, Katherine

    This data book presents satistics on countries with populations of more than one million. The statistics relate to economic development and the changes it is bringing about in the world. These statistics are measures of social and economic conditions in developing and industrial countries. Five indicators of economic development are presented,…

  9. A new statistic for the analysis of circular data in gamma-ray astronomy

    Science.gov (United States)

    Protheroe, R. J.

    1985-01-01

    A new statistic is proposed for the analysis of circular data. The statistic is designed specifically for situations where a test of uniformity is required which is powerful against alternatives in which a small fraction of the observations is grouped in a small range of directions, or phases.

  10. Toward Global Comparability of Sexual Orientation Data in Official Statistics: A Conceptual Framework of Sexual Orientation for Health Data Collection in New Zealand’s Official Statistics System

    Directory of Open Access Journals (Sweden)

    Frank Pega

    2013-01-01

    Full Text Available Objective. Effectively addressing health disparities experienced by sexual minority populations requires high-quality official data on sexual orientation. We developed a conceptual framework of sexual orientation to improve the quality of sexual orientation data in New Zealand’s Official Statistics System. Methods. We reviewed conceptual and methodological literature, culminating in a draft framework. To improve the framework, we held focus groups and key-informant interviews with sexual minority stakeholders and producers and consumers of official statistics. An advisory board of experts provided additional guidance. Results. The framework proposes working definitions of the sexual orientation topic and measurement concepts, describes dimensions of the measurement concepts, discusses variables framing the measurement concepts, and outlines conceptual grey areas. Conclusion. The framework proposes standard definitions and concepts for the collection of official sexual orientation data in New Zealand. It presents a model for producers of official statistics in other countries, who wish to improve the quality of health data on their citizens.

  11. Introduction to statistics and data analysis with exercises, solutions and applications in R

    CERN Document Server

    Heumann, Christian; Shalabh

    2016-01-01

    This introductory statistics textbook conveys the essential concepts and tools needed to develop and nurture statistical thinking. It presents descriptive, inductive and explorative statistical methods and guides the reader through the process of quantitative data analysis. In the experimental sciences and interdisciplinary research, data analysis has become an integral part of any scientific study. Issues such as judging the credibility of data, analyzing the data, evaluating the reliability of the obtained results and finally drawing the correct and appropriate conclusions from the results are vital. The text is primarily intended for undergraduate students in disciplines like business administration, the social sciences, medicine, politics, macroeconomics, etc. It features a wealth of examples, exercises and solutions with computer code in the statistical programming language R as well as supplementary material that will enable the reader to quickly adapt all methods to their own applications.

  12. Development of statistical analysis code for meteorological data (W-View)

    Energy Technology Data Exchange (ETDEWEB)

    Tachibana, Haruo; Sekita, Tsutomu; Yamaguchi, Takenori [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment

    2003-03-01

    A computer code (W-View: Weather View) was developed to analyze the meteorological data statistically based on 'the guideline of meteorological statistics for the safety analysis of nuclear power reactor' (Nuclear Safety Commission on January 28, 1982; revised on March 29, 2001). The code gives statistical meteorological data to assess the public dose in case of normal operation and severe accident to get the license of nuclear reactor operation. This code was revised from the original code used in a large office computer code to enable a personal computer user to analyze the meteorological data simply and conveniently and to make the statistical data tables and figures of meteorology. (author)

  13. Statistical analysis of vehicle crashes in Mississippi based on crash data from 2010 to 2014.

    Science.gov (United States)

    2017-08-15

    Traffic crash data from 2010 to 2014 were collected by Mississippi Department of Transportation (MDOT) and extracted for the study. Three tasks were conducted in this study: (1) geographic distribution of crashes; (2) descriptive statistics of crash ...

  14. A statistical method for evaluation of the experimental phase equilibrium data of simple clathrate hydrates

    DEFF Research Database (Denmark)

    Eslamimanesh, Ali; Gharagheizi, Farhad; Mohammadi, Amir H.

    2012-01-01

    We, herein, present a statistical method for diagnostics of the outliers in phase equilibrium data (dissociation data) of simple clathrate hydrates. The applied algorithm is performed on the basis of the Leverage mathematical approach, in which the statistical Hat matrix, Williams Plot, and the r......We, herein, present a statistical method for diagnostics of the outliers in phase equilibrium data (dissociation data) of simple clathrate hydrates. The applied algorithm is performed on the basis of the Leverage mathematical approach, in which the statistical Hat matrix, Williams Plot...... in exponential form is used to represent/predict the hydrate dissociation pressures for three-phase equilibrium conditions (liquid water/ice–vapor-hydrate). The investigated hydrate formers are methane, ethane, propane, carbon dioxide, nitrogen, and hydrogen sulfide. It is interpreted from the obtained results...

  15. A random-sum Wilcoxon statistic and its application to analysis of ROC and LROC data.

    Science.gov (United States)

    Tang, Liansheng Larry; Balakrishnan, N

    2011-01-01

    The Wilcoxon-Mann-Whitney statistic is commonly used for a distribution-free comparison of two groups. One requirement for its use is that the sample sizes of the two groups are fixed. This is violated in some of the applications such as medical imaging studies and diagnostic marker studies; in the former, the violation occurs since the number of correctly localized abnormal images is random, while in the latter the violation is due to some subjects not having observable measurements. For this reason, we propose here a random-sum Wilcoxon statistic for comparing two groups in the presence of ties, and derive its variance as well as its asymptotic distribution for large sample sizes. The proposed statistic includes the regular Wilcoxon rank-sum statistic. Finally, we apply the proposed statistic for summarizing location response operating characteristic data from a liver computed tomography study, and also for summarizing diagnostic accuracy of biomarker data.

  16. Statistical yearbook. 1998. Data available as of 30 November 2000. 45 ed

    International Nuclear Information System (INIS)

    2001-01-01

    This is the forty-fifth issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat, since 1948. The present issue contains series covering, in general, 1989-1998 or 1990-1999, using statistics available to the Statistics Division up to 30 November 2000. The Yearbook is based on data compiled by the Statistics Division from over 40 different international and national sources. These include the United Nations Statistics Division in the fields of national accounts, industry, energy, transport and international trade; the United Nations Statistics Division and Population Division in the field of demographic statistics; and data provided by over 20 offices of the United Nations system and international organizations in other specialized fields.United Nations agencies and other international organizations which furnished data are listed under 'Statistical sources and references' at the end of the Yearbook. Acknowledgement is gratefully made for their generous cooperation in providing data. The Statistics Division also publishes the Monthly Bulletin of Statistics, which provides a valuable complement to the Yearbook covering current international economic statistics for most countries and areas of the world and quarterly world and regional aggregates. Subscribers to the Monthly Bulletin of Statistics may also access the Bulletin on-line via the World Wide Web on Internet. MBS On-line allows time-sensitive statistics to reach users much faster than the traditional print publication. For further information see . The present issue of the Yearbook reflects a phased programme of major changes in its organization and presentation undertaken in 1990 which until then was relatively unchanged since the first issue was released in 1948. One result of this process has been to reduce the total number of tables from 140 in the 37th issue to 80 in the present issue and to include

  17. Parity Specific Birth Rates for West Germany: An Attempt to Combine Survey Data and Vital Statistics

    OpenAIRE

    Kreyenfeld, Michaela

    2014-01-01

    In this paper, we combine vital statistics and survey data to obtain parity specific birth rates for West Germany. Since vital statistics do not provide birth parity information, one is confined to using estimates. The robustness of these estimates is an issue, which is unfortunately only rarely addressed when fertility indicators for (West) Germany are reported. In order to check how reliable our results are, we estimate confidence intervals and compare them to results from survey data and e...

  18. The Statistic Test on Influence of Surface Treatment to Fatigue Lifetime with Limited Data

    OpenAIRE

    Suhartono, Agus

    2009-01-01

    Justifications on the influences of two or more parameters on fatigue strength are some times problematic due to the scatter nature of the fatigue data. Statistic test can facilitate the evaluation, whether the changes in material characteristics as a result of specific parameters of interest is significant. The statistic tests were applied to fatigue data of AISI 1045 steel specimens. The specimens are consisted of as received specimen, shot peened specimen with 15 and 16 Almen intensity as ...

  19. Citizen Data and Official Statistics: Background Document to a Collaborative Workshop

    DEFF Research Database (Denmark)

    Grommé, Francisca; Ustek, Funda; Ruppert, Evelyn

    2017-01-01

    This working paper was written in preparation for a collaborative workshop organised for statisticians, social scientists, information and app designers and other participants inside and outside academia. The autumn 2017 workshop aimed to develop the main principles for a citizen data app...... for official statistics. Through this work we sought to conceive of a new regime of data collection in official statistics through different devices. How can we capture citizens’ meanings and intentions when they produce data? Can we develop ‘smart’ methods that do not rely on cooperating with, and data...... generated by, large tech companies, but by developing methods and data co-produced with citizens? Towards addressing these issues we developed four key concepts outlined in this document: experimentalism, citizen data, smart statistics and privacy by design. We introduced these concepts to facilitate shared...

  20. Data Acquisition and Preprocessing in Studies on Humans: What Is Not Taught in Statistics Classes?

    Science.gov (United States)

    Zhu, Yeyi; Hernandez, Ladia M; Mueller, Peter; Dong, Yongquan; Forman, Michele R

    2013-01-01

    The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.

  1. Data base of accident and agricultural statistics for transportation risk assessment

    Energy Technology Data Exchange (ETDEWEB)

    Saricks, C.L.; Williams, R.G.; Hopf, M.R.

    1989-11-01

    A state-level data base of accident and agricultural statistics has been developed to support risk assessment for transportation of spent nuclear fuels and high-level radioactive wastes. This data base will enhance the modeling capabilities for more route-specific analyses of potential risks associated with transportation of these wastes to a disposal site. The data base and methodology used to develop state-specific accident and agricultural data bases are described, and summaries of accident and agricultural statistics are provided. 27 refs., 9 tabs.

  2. Data base of accident and agricultural statistics for transportation risk assessment

    International Nuclear Information System (INIS)

    Saricks, C.L.; Williams, R.G.; Hopf, M.R.

    1989-11-01

    A state-level data base of accident and agricultural statistics has been developed to support risk assessment for transportation of spent nuclear fuels and high-level radioactive wastes. This data base will enhance the modeling capabilities for more route-specific analyses of potential risks associated with transportation of these wastes to a disposal site. The data base and methodology used to develop state-specific accident and agricultural data bases are described, and summaries of accident and agricultural statistics are provided. 27 refs., 9 tabs

  3. Data management in large-scale collaborative toxicity studies: how to file experimental data for automated statistical analysis.

    Science.gov (United States)

    Stanzel, Sven; Weimer, Marc; Kopp-Schneider, Annette

    2013-06-01

    High-throughput screening approaches are carried out for the toxicity assessment of a large number of chemical compounds. In such large-scale in vitro toxicity studies several hundred or thousand concentration-response experiments are conducted. The automated evaluation of concentration-response data using statistical analysis scripts saves time and yields more consistent results in comparison to data analysis performed by the use of menu-driven statistical software. Automated statistical analysis requires that concentration-response data are available in a standardised data format across all compounds. To obtain consistent data formats, a standardised data management workflow must be established, including guidelines for data storage, data handling and data extraction. In this paper two procedures for data management within large-scale toxicological projects are proposed. Both procedures are based on Microsoft Excel files as the researcher's primary data format and use a computer programme to automate the handling of data files. The first procedure assumes that data collection has not yet started whereas the second procedure can be used when data files already exist. Successful implementation of the two approaches into the European project ACuteTox is illustrated. Copyright © 2012 Elsevier Ltd. All rights reserved.

  4. Data-Mining Opportunities for Small and Medium Enterprises with Official Statistics in the UK

    Directory of Open Access Journals (Sweden)

    Coleman Shirley Y.

    2016-12-01

    Full Text Available There is a growing interest in data amongst small and medium enterprises (SMEs. This article looks at ways in which SMEs can combine their internal company data with open data, such as official statistics, and thereby enhance their business opportunities. Case studies are given as illustrations of the statistical and data-mining methods involved in such integrated data analytics. The article considers the barriers that prevent more SMEs from benefitting in this field and appraises some of the initiatives that are aimed at helping to overcome them. The discussion emphasizes the importance of bringing people together from the business, IT, and statistical worlds and suggests ways for statisticians to make a greater impact.

  5. Test Statistics and Confidence Intervals to Establish Noninferiority between Treatments with Ordinal Categorical Data.

    Science.gov (United States)

    Zhang, Fanghong; Miyaoka, Etsuo; Huang, Fuping; Tanaka, Yutaka

    2015-01-01

    The problem for establishing noninferiority is discussed between a new treatment and a standard (control) treatment with ordinal categorical data. A measure of treatment effect is used and a method of specifying noninferiority margin for the measure is provided. Two Z-type test statistics are proposed where the estimation of variance is constructed under the shifted null hypothesis using U-statistics. Furthermore, the confidence interval and the sample size formula are given based on the proposed test statistics. The proposed procedure is applied to a dataset from a clinical trial. A simulation study is conducted to compare the performance of the proposed test statistics with that of the existing ones, and the results show that the proposed test statistics are better in terms of the deviation from nominal level and the power.

  6. Statistical yearbook 2002-2004. Data available as of February 2005. 49 ed

    International Nuclear Information System (INIS)

    2005-09-01

    This is the forty-ninth issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat. The data included generally cover the years between 1993 and 2003 and are, for the most part, those statistics which were available to the Statistics Division as of February 2005. The 81 tables of the Yearbook are based on data compiled by the Statistics Division from over 35 international and national sources. These sources include the United Nations Statistics Division in the fields of national accounts, industry, energy, transport and international trade, the United Nations Statistics Division and Population Division in the field of demographic statistics, and over 20 offices of the United Nations system and international organizations in other specialized fields. The Yearbook is organized in four parts. The first part, World and Region Summary, presents key world and regional aggregates and totals. In the other three parts, the subject matter is generally presented by countries or areas, with world and regional aggregates shown in some cases only. Parts two, three and four cover, respectively, population and social topics, national economic activity, and international economic relations. Each chapter ends with brief technical notes on statistical sources and methods for the tables it includes. References to sources and related methodological publications are provided at the end of the Yearbook in the section 'Statistical sources and references'. Annex I provides complete information on country and area nomenclature, and regional and other groupings used in the Yearbook. Annex II lists conversion coefficients and factors used in various tables. A list of tables added to or omitted from the last issue of the Yearbook is given in annex III. Symbols and conventions used in the Yearbook are shown in the section 'Explanatory notes, preceding the Introduction

  7. The German Birth Order Register - order-specific data generated from perinatal statistics and statistics on out-of-hospital births 2001-2008

    OpenAIRE

    Michaela Kreyenfeld; Rembrandt D. Scholz; Frederik Peters; Ines Wlosnewski

    2010-01-01

    Until 2008, Germany’s vital statistics did not include information on the biological order of each birth. This resulted in a dearth of important demographic indicators, such as the mean age at first birth and the level of childlessness. Researchers have tried to fill this gap by generating order-specific birth rates from survey data, and by combining survey data with vital statistics. This paper takes a different approach by using hospital statistics on births to generate birth order-specific...

  8. Powerful Inference With the D-Statistic on Low-Coverage Whole-Genome Data

    DEFF Research Database (Denmark)

    Soraggi, Samuele; Wiuf, Carsten; Albrechtsen, Anders

    2018-01-01

    The detection of ancient gene flow between human populations is an important issue in population genetics. A common tool for detecting ancient admixture events is the D-statistic. The D-statistic is based on the hypothesis of a genetic relationship that involves four populations, whose correctness...... is assessed by evaluating specific coincidences of alleles between the groups. When working with high throughput sequencing data calling genotypes accurately is not always possible, therefore the D-statistic currently samples a single base from the reads of one individual per population. This implies ignoring...... much of the information in the data, an issue especially striking in the case of ancient genomes. We provide a significant improvement to overcome the problems of the D-statistic by considering all reads from multiple individuals in each population. We also apply type-specific error correction...

  9. Statistical processing of natality data for the Czech Republic before and after the Chernobyl accident

    International Nuclear Information System (INIS)

    Klepetkova, Hana; Thinova, Lenka

    2010-01-01

    All available data regarding natality in Czechoslovakia (or the Czech Republic) before and after the Chernobyl accident are summarized. Data from the databases of the Czech Statistical Office and of the State Office for Nuclear Safety were used to analyze natality and mortality of children in the Czech Republic and to evaluate the relationship between the level of contamination and the change in the sex ratio at time of birth that was observed in some areas in November of 1986. Although the change in the ratio of newborn boys-to-girls ratio was statistically significant, no direct relationship between that ratio and the level of contamination was found. Statistically significant changes in the sex ratio also occurred in Czechoslovakia (or in the Czech Republic) in the past, both before and after the accident. Furthermore, no statistically significant changes in the rate of stillbirths and multiple pregnancies were observed after the Chernobyl accident

  10. Cosmology constraints from shear peak statistics in Dark Energy Survey Science Verification data

    International Nuclear Information System (INIS)

    Kacprzak, T.; Kirk, D.; Friedrich, O.; Amara, A.; Refregier, A.

    2016-01-01

    Shear peak statistics has gained a lot of attention recently as a practical alternative to the two-point statistics for constraining cosmological parameters. We perform a shear peak statistics analysis of the Dark Energy Survey (DES) Science Verification (SV) data, using weak gravitational lensing measurements from a 139 deg"2 field. We measure the abundance of peaks identified in aperture mass maps, as a function of their signal-to-noise ratio, in the signal-to-noise range 0 4 would require significant corrections, which is why we do not include them in our analysis. We compare our results to the cosmological constraints from the two-point analysis on the SV field and find them to be in good agreement in both the central value and its uncertainty. Lastly, we discuss prospects for future peak statistics analysis with upcoming DES data.

  11. A scan statistic for continuous data based on the normal probability model

    Directory of Open Access Journals (Sweden)

    Huang Lan

    2009-10-01

    Full Text Available Abstract Temporal, spatial and space-time scan statistics are commonly used to detect and evaluate the statistical significance of temporal and/or geographical disease clusters, without any prior assumptions on the location, time period or size of those clusters. Scan statistics are mostly used for count data, such as disease incidence or mortality. Sometimes there is an interest in looking for clusters with respect to a continuous variable, such as lead levels in children or low birth weight. For such continuous data, we present a scan statistic where the likelihood is calculated using the the normal probability model. It may also be used for other distributions, while still maintaining the correct alpha level. In an application of the new method, we look for geographical clusters of low birth weight in New York City.

  12. Finding differentially expressed genes in high dimensional data: Rank based test statistic via a distance measure.

    Science.gov (United States)

    Mathur, Sunil; Sadana, Ajit

    2015-12-01

    We present a rank-based test statistic for the identification of differentially expressed genes using a distance measure. The proposed test statistic is highly robust against extreme values and does not assume the distribution of parent population. Simulation studies show that the proposed test is more powerful than some of the commonly used methods, such as paired t-test, Wilcoxon signed rank test, and significance analysis of microarray (SAM) under certain non-normal distributions. The asymptotic distribution of the test statistic, and the p-value function are discussed. The application of proposed method is shown using a real-life data set. © The Author(s) 2011.

  13. Fundamentals and Catalytic Innovation: The Statistical and Data Management Center of the Antibacterial Resistance Leadership Group.

    Science.gov (United States)

    Huvane, Jacqueline; Komarow, Lauren; Hill, Carol; Tran, Thuy Tien T; Pereira, Carol; Rosenkranz, Susan L; Finnemeyer, Matt; Earley, Michelle; Jiang, Hongyu Jeanne; Wang, Rui; Lok, Judith; Evans, Scott R

    2017-03-15

    The Statistical and Data Management Center (SDMC) provides the Antibacterial Resistance Leadership Group (ARLG) with statistical and data management expertise to advance the ARLG research agenda. The SDMC is active at all stages of a study, including design; data collection and monitoring; data analyses and archival; and publication of study results. The SDMC enhances the scientific integrity of ARLG studies through the development and implementation of innovative and practical statistical methodologies and by educating research colleagues regarding the application of clinical trial fundamentals. This article summarizes the challenges and roles, as well as the innovative contributions in the design, monitoring, and analyses of clinical trials and diagnostic studies, of the ARLG SDMC. © The Author 2017. Published by Oxford University Press for the Infectious Diseases Society of America. All rights reserved. For permissions, e-mail: journals.permissions@oup.com.

  14. Statistical analyses of the magnet data for the advanced photon source storage ring magnets

    International Nuclear Information System (INIS)

    Kim, S.H.; Carnegie, D.W.; Doose, C.; Hogrefe, R.; Kim, K.; Merl, R.

    1995-01-01

    The statistics of the measured magnetic data of 80 dipole, 400 quadrupole, and 280 sextupole magnets of conventional resistive designs for the APS storage ring is summarized. In order to accommodate the vacuum chamber, the curved dipole has a C-type cross section and the quadrupole and sextupole cross sections have 180 degrees and 120 degrees symmetries, respectively. The data statistics include the integrated main fields, multipole coefficients, magnetic and mechanical axes, and roll angles of the main fields. The average and rms values of the measured magnet data meet the storage ring requirements

  15. A spatial scan statistic for survival data based on Weibull distribution.

    Science.gov (United States)

    Bhatt, Vijaya; Tiwari, Neeraj

    2014-05-20

    The spatial scan statistic has been developed as a geographical cluster detection analysis tool for different types of data sets such as Bernoulli, Poisson, ordinal, normal and exponential. We propose a scan statistic for survival data based on Weibull distribution. It may also be used for other survival distributions, such as exponential, gamma, and log normal. The proposed method is applied on the survival data of tuberculosis patients for the years 2004-2005 in Nainital district of Uttarakhand, India. Simulation studies reveal that the proposed method performs well for different survival distribution functions. Copyright © 2013 John Wiley & Sons, Ltd.

  16. NEW PARADIGM OF ANALYSIS OF STATISTICAL AND EXPERT DATA IN PROBLEMS OF ECONOMICS AND MANAGEMENT

    OpenAIRE

    Orlov A. I.

    2014-01-01

    The article is devoted to the methods of analysis of statistical and expert data in problems of economics and management that are discussed in the framework of scientific specialization "Mathematical methods of economy", including organizational-economic and economic-mathematical modeling, econometrics and statistics, as well as economic aspects of decision theory, systems analysis, cybernetics, operations research. The main provisions of the new paradigm of this scientific and practical fiel...

  17. An improved method for statistical analysis of raw accelerator mass spectrometry data

    International Nuclear Information System (INIS)

    Gutjahr, A.; Phillips, F.; Kubik, P.W.; Elmore, D.

    1987-01-01

    Hierarchical statistical analysis is an appropriate method for statistical treatment of raw accelerator mass spectrometry (AMS) data. Using Monte Carlo simulations we show that this method yields more accurate estimates of isotope ratios and analytical uncertainty than the generally used propagation of errors approach. The hierarchical analysis is also useful in design of experiments because it can be used to identify sources of variability. 8 refs., 2 figs

  18. Data Analysis and Graphing in an Introductory Physics Laboratory: Spreadsheet versus Statistics Suite

    Science.gov (United States)

    Peterlin, Primoz

    2010-01-01

    Two methods of data analysis are compared: spreadsheet software and a statistics software suite. Their use is compared analysing data collected in three selected experiments taken from an introductory physics laboratory, which include a linear dependence, a nonlinear dependence and a histogram. The merits of each method are compared. (Contains 7…

  19. Indexing Combined with Statistical Deflation as a Tool for Analysis of Longitudinal Data.

    Science.gov (United States)

    Babcock, Judith A.

    Indexing is a tool that can be used with longitudinal, quantitative data for analysis of relative changes and for comparisons of changes among items. For greater accuracy, raw financial data should be deflated into constant dollars prior to indexing. This paper demonstrates the procedures for indexing, statistical deflation, and the use of…

  20. Data analysis and graphing in an introductory physics laboratory: spreadsheet versus statistics suite

    International Nuclear Information System (INIS)

    Peterlin, Primoz

    2010-01-01

    Two methods of data analysis are compared: spreadsheet software and a statistics software suite. Their use is compared analysing data collected in three selected experiments taken from an introductory physics laboratory, which include a linear dependence, a nonlinear dependence and a histogram. The merits of each method are compared.

  1. 75 FR 56549 - National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards...

    Science.gov (United States)

    2010-09-16

    ... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards Staff, Announces the... Public Health Data Standards Staff, NCHS, 3311 Toledo Road, Room 2337, Hyattsville, Maryland 20782, e...

  2. 75 FR 39265 - National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards...

    Science.gov (United States)

    2010-07-08

    ... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards Staff, Announces the... Prevention, Classifications and Public Health Data Standards, 3311 Toledo Road, Room 2337, Hyattsville, MD...

  3. 78 FR 53148 - National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards...

    Science.gov (United States)

    2013-08-28

    ... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics (NCHS), Classifications and Public Health Data Standards Staff, Announces the... Administrator, Classifications and Public Health Data Standards Staff, NCHS, 3311 Toledo Road, Room 2337...

  4. Statistical analysis of solid waste composition data: Arithmetic mean, standard deviation and correlation coefficients.

    Science.gov (United States)

    Edjabou, Maklawe Essonanawe; Martín-Fernández, Josep Antoni; Scheutz, Charlotte; Astrup, Thomas Fruergaard

    2017-11-01

    Data for fractional solid waste composition provide relative magnitudes of individual waste fractions, the percentages of which always sum to 100, thereby connecting them intrinsically. Due to this sum constraint, waste composition data represent closed data, and their interpretation and analysis require statistical methods, other than classical statistics that are suitable only for non-constrained data such as absolute values. However, the closed characteristics of waste composition data are often ignored when analysed. The results of this study showed, for example, that unavoidable animal-derived food waste amounted to 2.21±3.12% with a confidence interval of (-4.03; 8.45), which highlights the problem of the biased negative proportions. A Pearson's correlation test, applied to waste fraction generation (kg mass), indicated a positive correlation between avoidable vegetable food waste and plastic packaging. However, correlation tests applied to waste fraction compositions (percentage values) showed a negative association in this regard, thus demonstrating that statistical analyses applied to compositional waste fraction data, without addressing the closed characteristics of these data, have the potential to generate spurious or misleading results. Therefore, ¨compositional data should be transformed adequately prior to any statistical analysis, such as computing mean, standard deviation and correlation coefficients. Copyright © 2017 Elsevier Ltd. All rights reserved.

  5. A log-Weibull spatial scan statistic for time to event data.

    Science.gov (United States)

    Usman, Iram; Rosychuk, Rhonda J

    2018-06-13

    Spatial scan statistics have been used for the identification of geographic clusters of elevated numbers of cases of a condition such as disease outbreaks. These statistics accompanied by the appropriate distribution can also identify geographic areas with either longer or shorter time to events. Other authors have proposed the spatial scan statistics based on the exponential and Weibull distributions. We propose the log-Weibull as an alternative distribution for the spatial scan statistic for time to events data and compare and contrast the log-Weibull and Weibull distributions through simulation studies. The effect of type I differential censoring and power have been investigated through simulated data. Methods are also illustrated on time to specialist visit data for discharged patients presenting to emergency departments for atrial fibrillation and flutter in Alberta during 2010-2011. We found northern regions of Alberta had longer times to specialist visit than other areas. We proposed the spatial scan statistic for the log-Weibull distribution as a new approach for detecting spatial clusters for time to event data. The simulation studies suggest that the test performs well for log-Weibull data.

  6. Using assemblage data in ecological indicators: A comparison and evaluation of commonly available statistical tools

    Science.gov (United States)

    Smith, Joseph M.; Mather, Martha E.

    2012-01-01

    Ecological indicators are science-based tools used to assess how human activities have impacted environmental resources. For monitoring and environmental assessment, existing species assemblage data can be used to make these comparisons through time or across sites. An impediment to using assemblage data, however, is that these data are complex and need to be simplified in an ecologically meaningful way. Because multivariate statistics are mathematical relationships, statistical groupings may not make ecological sense and will not have utility as indicators. Our goal was to define a process to select defensible and ecologically interpretable statistical simplifications of assemblage data in which researchers and managers can have confidence. For this, we chose a suite of statistical methods, compared the groupings that resulted from these analyses, identified convergence among groupings, then we interpreted the groupings using species and ecological guilds. When we tested this approach using a statewide stream fish dataset, not all statistical methods worked equally well. For our dataset, logistic regression (Log), detrended correspondence analysis (DCA), cluster analysis (CL), and non-metric multidimensional scaling (NMDS) provided consistent, simplified output. Specifically, the Log, DCA, CL-1, and NMDS-1 groupings were ≥60% similar to each other, overlapped with the fluvial-specialist ecological guild, and contained a common subset of species. Groupings based on number of species (e.g., Log, DCA, CL and NMDS) outperformed groupings based on abundance [e.g., principal components analysis (PCA) and Poisson regression]. Although the specific methods that worked on our test dataset have generality, here we are advocating a process (e.g., identifying convergent groupings with redundant species composition that are ecologically interpretable) rather than the automatic use of any single statistical tool. We summarize this process in step-by-step guidance for the

  7. Exploratory study on a statistical method to analyse time resolved data obtained during nanomaterial exposure measurements

    International Nuclear Information System (INIS)

    Clerc, F; Njiki-Menga, G-H; Witschger, O

    2013-01-01

    Most of the measurement strategies that are suggested at the international level to assess workplace exposure to nanomaterials rely on devices measuring, in real time, airborne particles concentrations (according different metrics). Since none of the instruments to measure aerosols can distinguish a particle of interest to the background aerosol, the statistical analysis of time resolved data requires special attention. So far, very few approaches have been used for statistical analysis in the literature. This ranges from simple qualitative analysis of graphs to the implementation of more complex statistical models. To date, there is still no consensus on a particular approach and the current period is always looking for an appropriate and robust method. In this context, this exploratory study investigates a statistical method to analyse time resolved data based on a Bayesian probabilistic approach. To investigate and illustrate the use of the this statistical method, particle number concentration data from a workplace study that investigated the potential for exposure via inhalation from cleanout operations by sandpapering of a reactor producing nanocomposite thin films have been used. In this workplace study, the background issue has been addressed through the near-field and far-field approaches and several size integrated and time resolved devices have been used. The analysis of the results presented here focuses only on data obtained with two handheld condensation particle counters. While one was measuring at the source of the released particles, the other one was measuring in parallel far-field. The Bayesian probabilistic approach allows a probabilistic modelling of data series, and the observed task is modelled in the form of probability distributions. The probability distributions issuing from time resolved data obtained at the source can be compared with the probability distributions issuing from the time resolved data obtained far-field, leading in a

  8. OkCupid Data for Introductory Statistics and Data Science Courses

    Science.gov (United States)

    Kim, Albert Y.; Escobedo-Land, Adriana

    2015-01-01

    We present a data set consisting of user profile data for 59,946 San Francisco OkCupid users (a free online dating website) from June 2012. The data set includes typical user information, lifestyle variables, and text responses to 10 essay questions. We present four example analyses suitable for use in undergraduate introductory probability and…

  9. Data for Development : An Evaluation of World Bank Support for Data and Statistical Capacity

    OpenAIRE

    Independent Evaluation Group

    2017-01-01

    This evaluation’s objective was to assess how effectively the World Bank has supported development data production, sharing, and use, and to suggest ways to improve its approach. This evaluation defines development data as data produced by country systems, the World Bank, or third parties on countries’ social, economic, and environmental issues. At the global level, the World Bank has a st...

  10. Conjunction analysis and propositional logic in fMRI data analysis using Bayesian statistics.

    Science.gov (United States)

    Rudert, Thomas; Lohmann, Gabriele

    2008-12-01

    To evaluate logical expressions over different effects in data analyses using the general linear model (GLM) and to evaluate logical expressions over different posterior probability maps (PPMs). In functional magnetic resonance imaging (fMRI) data analysis, the GLM was applied to estimate unknown regression parameters. Based on the GLM, Bayesian statistics can be used to determine the probability of conjunction, disjunction, implication, or any other arbitrary logical expression over different effects or contrast. For second-level inferences, PPMs from individual sessions or subjects are utilized. These PPMs can be combined to a logical expression and its probability can be computed. The methods proposed in this article are applied to data from a STROOP experiment and the methods are compared to conjunction analysis approaches for test-statistics. The combination of Bayesian statistics with propositional logic provides a new approach for data analyses in fMRI. Two different methods are introduced for propositional logic: the first for analyses using the GLM and the second for common inferences about different probability maps. The methods introduced extend the idea of conjunction analysis to a full propositional logic and adapt it from test-statistics to Bayesian statistics. The new approaches allow inferences that are not possible with known standard methods in fMRI. (c) 2008 Wiley-Liss, Inc.

  11. Statistical intercomparison of global climate models: A common principal component approach with application to GCM data

    International Nuclear Information System (INIS)

    Sengupta, S.K.; Boyle, J.S.

    1993-05-01

    Variables describing atmospheric circulation and other climate parameters derived from various GCMs and obtained from observations can be represented on a spatio-temporal grid (lattice) structure. The primary objective of this paper is to explore existing as well as some new statistical methods to analyze such data structures for the purpose of model diagnostics and intercomparison from a statistical perspective. Among the several statistical methods considered here, a new method based on common principal components appears most promising for the purpose of intercomparison of spatio-temporal data structures arising in the task of model/model and model/data intercomparison. A complete strategy for such an intercomparison is outlined. The strategy includes two steps. First, the commonality of spatial structures in two (or more) fields is captured in the common principal vectors. Second, the corresponding principal components obtained as time series are then compared on the basis of similarities in their temporal evolution

  12. Demonstration of a software design and statistical analysis methodology with application to patient outcomes data sets.

    Science.gov (United States)

    Mayo, Charles; Conners, Steve; Warren, Christopher; Miller, Robert; Court, Laurence; Popple, Richard

    2013-11-01

    With emergence of clinical outcomes databases as tools utilized routinely within institutions, comes need for software tools to support automated statistical analysis of these large data sets and intrainstitutional exchange from independent federated databases to support data pooling. In this paper, the authors present a design approach and analysis methodology that addresses both issues. A software application was constructed to automate analysis of patient outcomes data using a wide range of statistical metrics, by combining use of C#.Net and R code. The accuracy and speed of the code was evaluated using benchmark data sets. The approach provides data needed to evaluate combinations of statistical measurements for ability to identify patterns of interest in the data. Through application of the tools to a benchmark data set for dose-response threshold and to SBRT lung data sets, an algorithm was developed that uses receiver operator characteristic curves to identify a threshold value and combines use of contingency tables, Fisher exact tests, Welch t-tests, and Kolmogorov-Smirnov tests to filter the large data set to identify values demonstrating dose-response. Kullback-Leibler divergences were used to provide additional confirmation. The work demonstrates the viability of the design approach and the software tool for analysis of large data sets.

  13. ROOT - A C++ Framework for Petabyte Data Storage, Statistical Analysis and Visualization

    CERN Document Server

    Naumann, Axel; Ballintijn, Maarten; Bellenot, Bertrand; Biskup, Marek; Brun, Rene; Buncic, Nenad; Canal, Philippe; Casadei, Diego; Couet, Olivier; Fine, Valery; Franco, Leandro; Ganis, Gerardo; Gheata, Andrei; Gonzalez~Maline, David; Goto, Masaharu; Iwaszkiewicz, Jan; Kreshuk, Anna; Marcos Segura, Diego; Maunder, Richard; Moneta, Lorenzo; Offermann, Eddy; Onuchin, Valeriy; Panacek, Suzanne; Rademakers, Fons; Russo, Paul; Tadel, Matevz

    2009-01-01

    ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web, or a number of different shared file systems. In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, the RooFit package allows the user to perform complex data modeling and fitting while the RooStats library provides abstractions and implementations for advance...

  14. Statistical yearbook 2001. Data available as of 15 December 2003. 48 ed

    International Nuclear Information System (INIS)

    2004-01-01

    This is the forty-eight issue of the United Nations Statistical Yearbook, prepared by the Statistics Division, Department of Economic and Social Affairs of the United Nations Secretariat. It contains series covering, in general, 1990-1999 or 1991-2000, based on statistics available to the Statistics Division up to 15 December 2003. The major purpose of the Statistical Yearbook is to provide in a single volume a comprehensive compilation of internationally available statistics on social and economic conditions and activities, at world, regional and national levels, covering roughly a ten-year period. Most of the statistics presented in the Yearbook are extracted from more detailed, specialized publications prepared by the Statistics Division and by many other international statistical services. Thus, while the specialized publications concentrate on monitoring topics and trends in particular social and economic fields, the Statistical Yearbook tables provide data for a more comprehensive, overall description of social and economic structures, conditions, changes and activities. The objective has been to collect, systematize and coordinate the most essential components of comparable statistical information which can give a broad and, to the extent feasible, a consistent picture of social and economic processes at world, regional and national levels. More specifically, the Statistical Yearbook provides systematic information on a wide range of social and economic issues which are of concern in the United Nations system and among the governments and peoples of the world. A particular value of the Yearbook, but also its greatest challenge, is that these issues are extensively interrelated. Meaningful analysis of these issues requires systematization and coordination of the data across many fields. These issues include: General economic growth and related economic conditions; economic situation in developing countries and progress towards the objectives adopted for the

  15. Inference on network statistics by restricting to the network space: applications to sexual history data.

    Science.gov (United States)

    Goyal, Ravi; De Gruttola, Victor

    2018-01-30

    Analysis of sexual history data intended to describe sexual networks presents many challenges arising from the fact that most surveys collect information on only a very small fraction of the population of interest. In addition, partners are rarely identified and responses are subject to reporting biases. Typically, each network statistic of interest, such as mean number of sexual partners for men or women, is estimated independently of other network statistics. There is, however, a complex relationship among networks statistics; and knowledge of these relationships can aid in addressing concerns mentioned earlier. We develop a novel method that constrains a posterior predictive distribution of a collection of network statistics in order to leverage the relationships among network statistics in making inference about network properties of interest. The method ensures that inference on network properties is compatible with an actual network. Through extensive simulation studies, we also demonstrate that use of this method can improve estimates in settings where there is uncertainty that arises both from sampling and from systematic reporting bias compared with currently available approaches to estimation. To illustrate the method, we apply it to estimate network statistics using data from the Chicago Health and Social Life Survey. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.

  16. Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation.

    Science.gov (United States)

    Yigzaw, Kassaye Yitbarek; Michalas, Antonis; Bellika, Johan Gustav

    2017-01-03

    Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N - 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians.

  17. Identification of reliable gridded reference data for statistical downscaling methods in Alberta

    Science.gov (United States)

    Eum, H. I.; Gupta, A.

    2017-12-01

    Climate models provide essential information to assess impacts of climate change at regional and global scales. However, statistical downscaling methods have been applied to prepare climate model data for various applications such as hydrologic and ecologic modelling at a watershed scale. As the reliability and (spatial and temporal) resolution of statistically downscaled climate data mainly depend on a reference data, identifying the most reliable reference data is crucial for statistical downscaling. A growing number of gridded climate products are available for key climate variables which are main input data to regional modelling systems. However, inconsistencies in these climate products, for example, different combinations of climate variables, varying data domains and data lengths and data accuracy varying with physiographic characteristics of the landscape, have caused significant challenges in selecting the most suitable reference climate data for various environmental studies and modelling. Employing various observation-based daily gridded climate products available in public domain, i.e. thin plate spline regression products (ANUSPLIN and TPS), inverse distance method (Alberta Townships), and numerical climate model (North American Regional Reanalysis) and an optimum interpolation technique (Canadian Precipitation Analysis), this study evaluates the accuracy of the climate products at each grid point by comparing with the Adjusted and Homogenized Canadian Climate Data (AHCCD) observations for precipitation, minimum and maximum temperature over the province of Alberta. Based on the performance of climate products at AHCCD stations, we ranked the reliability of these publically available climate products corresponding to the elevations of stations discretized into several classes. According to the rank of climate products for each elevation class, we identified the most reliable climate products based on the elevation of target points. A web-based system

  18. A simple method to downscale daily wind statistics to hourly wind data

    OpenAIRE

    Guo, Zhongling

    2013-01-01

    Wind is the principal driver in the wind erosion models. The hourly wind speed data were generally required for precisely wind erosion modeling. In this study, a simple method to generate hourly wind speed data from daily wind statistics (daily average and maximum wind speeds together or daily average wind speed only) was established. A typical windy location with 3285 days (9 years) measured hourly wind speed data were used to validate the downscaling method. The results showed that the over...

  19. Research on cloud background infrared radiation simulation based on fractal and statistical data

    Science.gov (United States)

    Liu, Xingrun; Xu, Qingshan; Li, Xia; Wu, Kaifeng; Dong, Yanbing

    2018-02-01

    Cloud is an important natural phenomenon, and its radiation causes serious interference to infrared detector. Based on fractal and statistical data, a method is proposed to realize cloud background simulation, and cloud infrared radiation data field is assigned using satellite radiation data of cloud. A cloud infrared radiation simulation model is established using matlab, and it can generate cloud background infrared images for different cloud types (low cloud, middle cloud, and high cloud) in different months, bands and sensor zenith angles.

  20. Data Mining Foundations and Intelligent Paradigms Volume 2 Statistical, Bayesian, Time Series and other Theoretical Aspects

    CERN Document Server

    Jain, Lakhmi

    2012-01-01

    Data mining is one of the most rapidly growing research areas in computer science and statistics. In Volume 2 of this three volume series, we have brought together contributions from some of the most prestigious researchers in theoretical data mining. Each of the chapters is self contained. Statisticians and applied scientists/ engineers will find this volume valuable. Additionally, it provides a sourcebook for graduate students interested in the current direction of research in data mining.

  1. Statistical issues in the parton distribution analysis of the Tevatron jet data

    International Nuclear Information System (INIS)

    Alekhin, S.; Bluemlein, J.; Moch, S.O.; Hamburg Univ.

    2012-11-01

    We analyse a tension between the D0 and CDF inclusive jet data and the perturbative QCD calculations, which are based on the ABKM09 and ABM11 parton distribution functions (PDFs) within the nuisance parameter framework. Particular attention is paid on the uncertainties in the nuisance parameters due to the data fluctuations and the PDF errors. We show that with account of these uncertainties the nuisance parameters do not demonstrate a statistically significant excess. A statistical bias of the estimator based on the nuisance parameters is also discussed.

  2. Bias in iterative reconstruction of low-statistics PET data: benefits of a resolution model

    Energy Technology Data Exchange (ETDEWEB)

    Walker, M D; Asselin, M-C; Julyan, P J; Feldmann, M; Matthews, J C [School of Cancer and Enabling Sciences, Wolfson Molecular Imaging Centre, MAHSC, University of Manchester, Manchester M20 3LJ (United Kingdom); Talbot, P S [Mental Health and Neurodegeneration Research Group, Wolfson Molecular Imaging Centre, MAHSC, University of Manchester, Manchester M20 3LJ (United Kingdom); Jones, T, E-mail: matthew.walker@manchester.ac.uk [Academic Department of Radiation Oncology, Christie Hospital, University of Manchester, Manchester M20 4BX (United Kingdom)

    2011-02-21

    Iterative image reconstruction methods such as ordered-subset expectation maximization (OSEM) are widely used in PET. Reconstructions via OSEM are however reported to be biased for low-count data. We investigated this and considered the impact for dynamic PET. Patient listmode data were acquired in [{sup 11}C]DASB and [{sup 15}O]H{sub 2}O scans on the HRRT brain PET scanner. These data were subsampled to create many independent, low-count replicates. The data were reconstructed and the images from low-count data were compared to the high-count originals (from the same reconstruction method). This comparison enabled low-statistics bias to be calculated for the given reconstruction, as a function of the noise-equivalent counts (NEC). Two iterative reconstruction methods were tested, one with and one without an image-based resolution model (RM). Significant bias was observed when reconstructing data of low statistical quality, for both subsampled human and simulated data. For human data, this bias was substantially reduced by including a RM. For [{sup 11}C]DASB the low-statistics bias in the caudate head at 1.7 M NEC (approx. 30 s) was -5.5% and -13% with and without RM, respectively. We predicted biases in the binding potential of -4% and -10%. For quantification of cerebral blood flow for the whole-brain grey- or white-matter, using [{sup 15}O]H{sub 2}O and the PET autoradiographic method, a low-statistics bias of <2.5% and <4% was predicted for reconstruction with and without the RM. The use of a resolution model reduces low-statistics bias and can hence be beneficial for quantitative dynamic PET.

  3. Explanation of the methods employed in the statistical evaluation of SALE program data

    International Nuclear Information System (INIS)

    Bracey, J.T.; Soriano, M.

    1981-01-01

    The analysis of Safeguards Analytical Laboratory Evaluation (SALE) bimonthly data is described. Statistical procedures are discussed in Section A, followed by the descriptions of tabular and graphic values in Section B. Calculation formulae for the various statistics in the reports are presented in Section C. SALE data reported to New Brunswick Laboratory (NBL) are entered into a computerized system through routine data processing procedures. Bimonthly and annual reports are generated from this data system. In the bimonthly data analysis, data from the six most recent reporting periods of each laboratory-material-analytical method combination are utilized. Analysis results in the bimonthly reports are only presented for those participants who have reported data at least once during the last 12-month period. Reported values are transformed to relative percent difference values calculated by [(reported value - reference value)/reference value] x 100. Analysis of data is performed on these transformed values. Accordingly, the results given in the bimonthly report are (relative) percent differences (% DIFF). Suspect, large variations are verified with individual participants to eliminate errors in the transcription process. Statistical extreme values are not excluded from bimonthly analysis; all data are used

  4. RADSS: an integration of GIS, spatial statistics, and network service for regional data mining

    Science.gov (United States)

    Hu, Haitang; Bao, Shuming; Lin, Hui; Zhu, Qing

    2005-10-01

    Regional data mining, which aims at the discovery of knowledge about spatial patterns, clusters or association between regions, has widely applications nowadays in social science, such as sociology, economics, epidemiology, crime, and so on. Many applications in the regional or other social sciences are more concerned with the spatial relationship, rather than the precise geographical location. Based on the spatial continuity rule derived from Tobler's first law of geography: observations at two sites tend to be more similar to each other if the sites are close together than if far apart, spatial statistics, as an important means for spatial data mining, allow the users to extract the interesting and useful information like spatial pattern, spatial structure, spatial association, spatial outlier and spatial interaction, from the vast amount of spatial data or non-spatial data. Therefore, by integrating with the spatial statistical methods, the geographical information systems will become more powerful in gaining further insights into the nature of spatial structure of regional system, and help the researchers to be more careful when selecting appropriate models. However, the lack of such tools holds back the application of spatial data analysis techniques and development of new methods and models (e.g., spatio-temporal models). Herein, we make an attempt to develop such an integrated software and apply it into the complex system analysis for the Poyang Lake Basin. This paper presents a framework for integrating GIS, spatial statistics and network service in regional data mining, as well as their implementation. After discussing the spatial statistics methods involved in regional complex system analysis, we introduce RADSS (Regional Analysis and Decision Support System), our new regional data mining tool, by integrating GIS, spatial statistics and network service. RADSS includes the functions of spatial data visualization, exploratory spatial data analysis, and

  5. Drug safety data mining with a tree-based scan statistic.

    Science.gov (United States)

    Kulldorff, Martin; Dashevsky, Inna; Avery, Taliser R; Chan, Arnold K; Davis, Robert L; Graham, David; Platt, Richard; Andrade, Susan E; Boudreau, Denise; Gunter, Margaret J; Herrinton, Lisa J; Pawloski, Pamala A; Raebel, Marsha A; Roblin, Douglas; Brown, Jeffrey S

    2013-05-01

    In post-marketing drug safety surveillance, data mining can potentially detect rare but serious adverse events. Assessing an entire collection of drug-event pairs is traditionally performed on a predefined level of granularity. It is unknown a priori whether a drug causes a very specific or a set of related adverse events, such as mitral valve disorders, all valve disorders, or different types of heart disease. This methodological paper evaluates the tree-based scan statistic data mining method to enhance drug safety surveillance. We use a three-million-member electronic health records database from the HMO Research Network. Using the tree-based scan statistic, we assess the safety of selected antifungal and diabetes drugs, simultaneously evaluating overlapping diagnosis groups at different granularity levels, adjusting for multiple testing. Expected and observed adverse event counts were adjusted for age, sex, and health plan, producing a log likelihood ratio test statistic. Out of 732 evaluated disease groupings, 24 were statistically significant, divided among 10 non-overlapping disease categories. Five of the 10 signals are known adverse effects, four are likely due to confounding by indication, while one may warrant further investigation. The tree-based scan statistic can be successfully applied as a data mining tool in drug safety surveillance using observational data. The total number of statistical signals was modest and does not imply a causal relationship. Rather, data mining results should be used to generate candidate drug-event pairs for rigorous epidemiological studies to evaluate the individual and comparative safety profiles of drugs. Copyright © 2013 John Wiley & Sons, Ltd.

  6. Methods in pharmacoepidemiology: a review of statistical analyses and data reporting in pediatric drug utilization studies.

    Science.gov (United States)

    Sequi, Marco; Campi, Rita; Clavenna, Antonio; Bonati, Maurizio

    2013-03-01

    To evaluate the quality of data reporting and statistical methods performed in drug utilization studies in the pediatric population. Drug utilization studies evaluating all drug prescriptions to children and adolescents published between January 1994 and December 2011 were retrieved and analyzed. For each study, information on measures of exposure/consumption, the covariates considered, descriptive and inferential analyses, statistical tests, and methods of data reporting was extracted. An overall quality score was created for each study using a 12-item checklist that took into account the presence of outcome measures, covariates of measures, descriptive measures, statistical tests, and graphical representation. A total of 22 studies were reviewed and analyzed. Of these, 20 studies reported at least one descriptive measure. The mean was the most commonly used measure (18 studies), but only five of these also reported the standard deviation. Statistical analyses were performed in 12 studies, with the chi-square test being the most commonly performed test. Graphs were presented in 14 papers. Sixteen papers reported the number of drug prescriptions and/or packages, and ten reported the prevalence of the drug prescription. The mean quality score was 8 (median 9). Only seven of the 22 studies received a score of ≥10, while four studies received a score of statistical methods and reported data in a satisfactory manner. We therefore conclude that the methodology of drug utilization studies needs to be improved.

  7. Statistical Multipath Model Based on Experimental GNSS Data in Static Urban Canyon Environment

    Directory of Open Access Journals (Sweden)

    Yuze Wang

    2018-04-01

    Full Text Available A deep understanding of multipath characteristics is essential to design signal simulators and receivers in global navigation satellite system applications. As a new constellation is deployed and more applications occur in the urban environment, the statistical multipath models of navigation signal need further study. In this paper, we present statistical distribution models of multipath time delay, multipath power attenuation, and multipath fading frequency based on the experimental data in the urban canyon environment. The raw data of multipath characteristics are obtained by processing real navigation signal to study the statistical distribution. By fitting the statistical data, it shows that the probability distribution of time delay follows a gamma distribution which is related to the waiting time of Poisson distributed events. The fading frequency follows an exponential distribution, and the mean of multipath power attenuation decreases linearly with an increasing time delay. In addition, the detailed statistical characteristics for different elevations and orbits satellites is studied, and the parameters of each distribution are quite different. The research results give useful guidance for navigation simulator and receiver designers.

  8. A flexible statistics web processing service--added value for information systems for experiment data.

    Science.gov (United States)

    Heimann, Dennis; Nieschulze, Jens; König-Ries, Birgitta

    2010-04-20

    Data management in the life sciences has evolved from simple storage of data to complex information systems providing additional functionalities like analysis and visualization capabilities, demanding the integration of statistical tools. In many cases the used statistical tools are hard-coded within the system. That leads to an expensive integration, substitution, or extension of tools because all changes have to be done in program code. Other systems are using generic solutions for tool integration but adapting them to another system is mostly rather extensive work. This paper shows a way to provide statistical functionality over a statistics web service, which can be easily integrated in any information system and set up using XML configuration files. The statistical functionality is extendable by simply adding the description of a new application to a configuration file. The service architecture as well as the data exchange process between client and service and the adding of analysis applications to the underlying service provider are described. Furthermore a practical example demonstrates the functionality of the service.

  9. Infodemiological data concerning silicosis in the USA in the period 2004–2010 correlating with real-world statistical data

    Directory of Open Access Journals (Sweden)

    Nicola Luigi Bragazzi

    2017-02-01

    Full Text Available This article reports data concerning silicosis-related web-activities using Google Trends (GT capturing the Internet behavior in the USA for the period 2004–2010. GT-generated data were then compared with the most recent available epidemiological data of silicosis mortality obtained from the Centers for Disease Control and Prevention for the same study period. Statistically significant correlations with epidemiological data of silicosis (r=0.805, p-value <0.05 and other related web searches were found. The temporal trend well correlated with the epidemiological data, as well as the geospatial distribution of the web-activities with the geographic epidemiology of silicosis.

  10. Linear regression models and k-means clustering for statistical analysis of fNIRS data.

    Science.gov (United States)

    Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

    2015-02-01

    We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets.

  11. Evaluation of Solid Rocket Motor Component Data Using a Commercially Available Statistical Software Package

    Science.gov (United States)

    Stefanski, Philip L.

    2015-01-01

    Commercially available software packages today allow users to quickly perform the routine evaluations of (1) descriptive statistics to numerically and graphically summarize both sample and population data, (2) inferential statistics that draws conclusions about a given population from samples taken of it, (3) probability determinations that can be used to generate estimates of reliability allowables, and finally (4) the setup of designed experiments and analysis of their data to identify significant material and process characteristics for application in both product manufacturing and performance enhancement. This paper presents examples of analysis and experimental design work that has been conducted using Statgraphics®(Registered Trademark) statistical software to obtain useful information with regard to solid rocket motor propellants and internal insulation material. Data were obtained from a number of programs (Shuttle, Constellation, and Space Launch System) and sources that include solid propellant burn rate strands, tensile specimens, sub-scale test motors, full-scale operational motors, rubber insulation specimens, and sub-scale rubber insulation analog samples. Besides facilitating the experimental design process to yield meaningful results, statistical software has demonstrated its ability to quickly perform complex data analyses and yield significant findings that might otherwise have gone unnoticed. One caveat to these successes is that useful results not only derive from the inherent power of the software package, but also from the skill and understanding of the data analyst.

  12. IMPROVING QUALITY OF STATISTICAL PROCESS CONTROL BY DEALING WITH NON‐NORMAL DATA IN AUTOMOTIVE INDUSTRY

    Directory of Open Access Journals (Sweden)

    Zuzana ANDRÁSSYOVÁ

    2012-07-01

    Full Text Available Study deals with an analysis of data to the effect that it improves the quality of statistical tools in processes of assembly of automobile seats. Normal distribution of variables is one of inevitable conditions for the analysis, examination, and improvement of the manufacturing processes (f. e.: manufacturing process capability although, there are constantly more approaches to non‐normal data handling. An appropriate probability distribution of measured data is firstly tested by the goodness of fit of empirical distribution with theoretical normal distribution on the basis of hypothesis testing using programme StatGraphics Centurion XV.II. Data are collected from the assembly process of 1st row automobile seats for each characteristic of quality (Safety Regulation ‐S/R individually. Study closely processes the measured data of an airbag´s assembly and it aims to accomplish the normal distributed data and apply it the statistical process control. Results of the contribution conclude in a statement of rejection of the null hypothesis (measured variables do not follow the normal distribution therefore it is necessary to begin to work on data transformation supported by Minitab15. Even this approach does not reach a normal distributed data and so should be proposed a procedure that leads to the quality output of whole statistical control of manufacturing processes.

  13. Statistical behavior of foreshock Langmuir waves observed by the Cluster wideband data plasma wave receiver

    Directory of Open Access Journals (Sweden)

    K. Sigsbee

    2004-07-01

    Full Text Available We present the statistics of Langmuir wave amplitudes in the Earth's foreshock using Cluster Wideband Data (WBD Plasma Wave Receiver electric field waveforms from spacecraft 2, 3 and 4 on 26 March 2002. The largest amplitude Langmuir waves were observed by Cluster near the boundary between the foreshock and solar wind, in agreement with earlier studies. The characteristics of the waves were similar for all three spacecraft, suggesting that variations in foreshock structure must occur on scales greater than the 50-100km spacecraft separations. The electric field amplitude probability distributions constructed using waveforms from the Cluster WBD Plasma Wave Receiver generally followed the log-normal statistics predicted by stochastic growth theory for the event studied. Comparison with WBD receiver data from 17 February 2002, when spacecraft 4 was set in a special manual gain mode, suggests non-optimal auto-ranging of the instrument may have had some influence on the statistics.

  14. Statistical behavior of foreshock Langmuir waves observed by the Cluster wideband data plasma wave receiver

    Directory of Open Access Journals (Sweden)

    K. Sigsbee

    2004-07-01

    Full Text Available We present the statistics of Langmuir wave amplitudes in the Earth's foreshock using Cluster Wideband Data (WBD Plasma Wave Receiver electric field waveforms from spacecraft 2, 3 and 4 on 26 March 2002. The largest amplitude Langmuir waves were observed by Cluster near the boundary between the foreshock and solar wind, in agreement with earlier studies. The characteristics of the waves were similar for all three spacecraft, suggesting that variations in foreshock structure must occur on scales greater than the 50-100km spacecraft separations. The electric field amplitude probability distributions constructed using waveforms from the Cluster WBD Plasma Wave Receiver generally followed the log-normal statistics predicted by stochastic growth theory for the event studied. Comparison with WBD receiver data from 17 February 2002, when spacecraft 4 was set in a special manual gain mode, suggests non-optimal auto-ranging of the instrument may have had some influence on the statistics.

  15. On the statistical comparison of climate model output and climate data

    International Nuclear Information System (INIS)

    Solow, A.R.

    1991-01-01

    Some broad issues arising in the statistical comparison of the output of climate models with the corresponding climate data are reviewed. Particular attention is paid to the question of detecting climate change. The purpose of this paper is to review some statistical approaches to the comparison of the output of climate models with climate data. There are many statistical issues arising in such a comparison. The author will focus on some of the broader issues, although some specific methodological questions will arise along the way. One important potential application of the approaches discussed in this paper is the detection of climate change. Although much of the discussion will be fairly general, he will try to point out the appropriate connections to the detection question. 9 refs

  16. On the statistical comparison of climate model output and climate data

    International Nuclear Information System (INIS)

    Solow, A.R.

    1990-01-01

    Some broad issues arising in the statistical comparison of the output of climate models with the corresponding climate data are reviewed. Particular attention is paid to the question of detecting climate change. The purpose of this paper is to review some statistical approaches to the comparison of the output of climate models with climate data. There are many statistical issues arising in such a comparison. The author will focus on some of the broader issues, although some specific methodological questions will arise along the way. One important potential application of the approaches discussed in this paper is the detection of climate change. Although much of the discussion will be fairly general, he will try to point out the appropriate connections to the detection question

  17. Statistical evaluation of Pacific Northwest Residential Energy Consumption Survey weather data

    Energy Technology Data Exchange (ETDEWEB)

    Tawil, J.J.

    1986-02-01

    This report addresses an issue relating to energy consumption and conservation in the residential sector. BPA has obtained two meteorological data bases for use with its 1983 Pacific Northwest Residential Energy Survey (PNWRES). One data base consists of temperature data from weather stations; these have been aggregated to form a second data base that covers the National Oceanographic and Atmospheric Administration (NOAA) climatic divisions. At BPA's request, Pacific Northwest Laboratory has produced a household energy use model for both electricity and natural gas in order to determine whether the statistically estimated parameters of the model significantly differ when the two different meteorological data bases are used.

  18. Data Mining Methods Applied to Flight Operations Quality Assurance Data: A Comparison to Standard Statistical Methods

    Science.gov (United States)

    Stolzer, Alan J.; Halford, Carl

    2007-01-01

    In a previous study, multiple regression techniques were applied to Flight Operations Quality Assurance-derived data to develop parsimonious model(s) for fuel consumption on the Boeing 757 airplane. The present study examined several data mining algorithms, including neural networks, on the fuel consumption problem and compared them to the multiple regression results obtained earlier. Using regression methods, parsimonious models were obtained that explained approximately 85% of the variation in fuel flow. In general data mining methods were more effective in predicting fuel consumption. Classification and Regression Tree methods reported correlation coefficients of .91 to .92, and General Linear Models and Multilayer Perceptron neural networks reported correlation coefficients of about .99. These data mining models show great promise for use in further examining large FOQA databases for operational and safety improvements.

  19. Multivariate mixed linear model analysis of longitudinal data: an information-rich statistical technique for analyzing disease resistance data

    Science.gov (United States)

    The mixed linear model (MLM) is currently among the most advanced and flexible statistical modeling techniques and its use in tackling problems in plant pathology has begun surfacing in the literature. The longitudinal MLM is a multivariate extension that handles repeatedly measured data, such as r...

  20. MONITORING INTERNATIONAL MIGRATION FLOWS IN EUROPE - TOWARDS A STATISTICAL-DATA BASE COMBINING DATA FROM DIFFERENT SOURCES

    NARCIS (Netherlands)

    WILLEKENS, F

    1994-01-01

    The paper reviews techniques developed in demography, geography and statistics that are useful for bridging the gap between available data on international migration flows and the information required for policy making and research. The basic idea of the paper is as follows: to establish a coherent

  1. Special study for the statistical evaluation of groundwater data trends. Final report

    International Nuclear Information System (INIS)

    1993-05-01

    Analysis of trends over time in the concentrations of chemicals in groundwater at Uranium Mill Tailings Remedial Action (UMTRA) Project sites can provide valuable information for monitoring the performance of disposal cells and the effectiveness of groundwater restoration activities. Random variation in data may obscure real trends or may produce the illusion of a trend where none exists, so statistical methods are needed to reliably detect and estimate trends. Trend analysis includes both trend detection and estimation. Trend detection uses statistical hypothesis testing and provides a yes or no answer regarding the existence of a trend. Hypothesis tests try to reach a balance between false negative and false positive conclusions. To quantify the magnitude of a trend, estimation is required. This report presents the statistical concepts that are necessary for understanding trend analysis. The types of patterns most likely to occur in UMTRA data sets are emphasized. Two general approaches to analyzing data for trends are proposed and recommendations are given to assist UMTRA Project staff in selecting an appropriate method for their site data. Trend analysis is much more difficult when data contain values less than the reported laboratory detection limit. The complications that arise are explained. This report also discusses the impact of data collection procedures on statistical trend methods and offers recommendations to improve the efficiency of the methods and reduce sampling costs. Guidance for determining how many sampling rounds might be needed by statistical methods to detect trends of various magnitudes is presented. This information could be useful in planning site monitoring activities

  2. A global approach to estimate irrigated areas - a comparison between different data and statistics

    Science.gov (United States)

    Meier, Jonas; Zabel, Florian; Mauser, Wolfram

    2018-02-01

    Agriculture is the largest global consumer of water. Irrigated areas constitute 40 % of the total area used for agricultural production (FAO, 2014a) Information on their spatial distribution is highly relevant for regional water management and food security. Spatial information on irrigation is highly important for policy and decision makers, who are facing the transition towards more efficient sustainable agriculture. However, the mapping of irrigated areas still represents a challenge for land use classifications, and existing global data sets differ strongly in their results. The following study tests an existing irrigation map based on statistics and extends the irrigated area using ancillary data. The approach processes and analyzes multi-temporal normalized difference vegetation index (NDVI) SPOT-VGT data and agricultural suitability data - both at a spatial resolution of 30 arcsec - incrementally in a multiple decision tree. It covers the period from 1999 to 2012. The results globally show a 18 % larger irrigated area than existing approaches based on statistical data. The largest differences compared to the official national statistics are found in Asia and particularly in China and India. The additional areas are mainly identified within already known irrigated regions where irrigation is more dense than previously estimated. The validation with global and regional products shows the large divergence of existing data sets with respect to size and distribution of irrigated areas caused by spatial resolution, the considered time period and the input data and assumption made.

  3. 77 FR 65585 - Renewal of the Bureau of Labor Statistics Data Users Advisory Committee

    Science.gov (United States)

    2012-10-29

    ... the U.S. economy, including the labor, business, research, academic and government communities, on... reports, and on gaps between or the need for new Bureau statistics. The Committee will function solely as.... All committee members will have extensive research or practical experience using BLS data. The...

  4. 42 CFR 417.568 - Adequate financial records, statistical data, and cost finding.

    Science.gov (United States)

    2010-10-01

    ... this section, on the accrual method of accounting. (3) For governmental institutions that use a cash basis of accounting, cost data developed on this basis is acceptable. However, only depreciation on... definitions and accounting, statistics, and reporting practices that are widely accepted in the health care...

  5. QuantCrit: Education, Policy, "Big Data" and Principles for a Critical Race Theory of Statistics

    Science.gov (United States)

    Gillborn, David; Warmington, Paul; Demack, Sean

    2018-01-01

    Quantitative research enjoys heightened esteem among policy-makers, media, and the general public. Whereas qualitative research is frequently dismissed as subjective and impressionistic, statistics are often assumed to be objective and factual. We argue that these distinctions are wholly false; quantitative data is no less socially constructed…

  6. Statistical Literacy for Active Citizenship: A Call for Data Science Education

    Science.gov (United States)

    Engel, Joachim

    2017-01-01

    Data are abundant, quantitative information about the state of society and the wider world is around us more than ever. Paradoxically, recent trends in the public discourse point towards a post-factual world that seems content to ignore or misrepresent empirical evidence. As statistics educators we are challenged to promote understanding of…

  7. Neutron stars in the light of SKA: Data, statistics, and science

    Indian Academy of Sciences (India)

    8

    2016-09-10

    Sep 10, 2016 ... neutron star astrophysics: Through the case studies presented here, we hope to convey the challenges involved in devising or adopting statistical methods in the light of the .... The specific tests we applied to a recent version of a glitch dataset ... model the pulse energy data, a robust multivariate method to ...

  8. The Effect of Project-Based Learning on Students' Statistical Literacy Levels for Data Representation

    Science.gov (United States)

    Koparan, Timur; Güven, Bülent

    2015-01-01

    The point of this study is to define the effect of project-based learning approach on 8th Grade secondary-school students' statistical literacy levels for data representation. To achieve this goal, a test which consists of 12 open-ended questions in accordance with the views of experts was developed. Seventy 8th grade secondary-school students, 35…

  9. Making Women Count: Gender-Typing, Technology and Path Dependencies in Dutch Statistical Data Processing

    NARCIS (Netherlands)

    van den Ende, Jan; van Oost, Elizabeth C.J.

    2001-01-01

    This article is a longitudinal analysis of the relation between gendered labour divisions and new data processing technologies at the Dutch Central Bureau of Statistics (CBS). Following social-constructivist and evolutionary economic approaches, the authors hold that the relation between technology

  10. Study of the effects of photoelectron statistics on Thomson scattering data

    International Nuclear Information System (INIS)

    Hart, G.W.; Levinton, F.M.; McNeill, D.H.

    1986-01-01

    A computer code has been developed which simulates a Thomson scattering measurement, from the counting statistics of the input channels through the mathematical analysis of the data. The scattered and background signals in each of the wavelength channels are assumed to obey Poisson statistics, and the spectral data are fitted to a Gaussian curve using a nonlinear least-squares fitting algorithm. This method goes beyond the usual calculation of the signal-to-noise ratio for the hardware and gives a quantitative measure of the effect of the noise on the final measurement. This method is applicable to Thomson scattering measurements in which the signal-to-noise ratio is low due to either low signal or high background. Thomson scattering data from the S-1 spheromak have been compared to this simulation, and they have been found to be in good agreement. This code has proven to be useful in assessing the effects of counting statistics relative to shot-to-shot variability in producing the observed spread in the data. It was also useful for designing improvements for the S-1 Thomson scattering system, and this method would be applicable to any measurement affected by counting statistics

  11. A statistical power analysis of woody carbon flux from forest inventory data

    Science.gov (United States)

    James A. Westfall; Christopher W. Woodall; Mark A. Hatfield

    2013-01-01

    At a national scale, the carbon (C) balance of numerous forest ecosystem C pools can be monitored using a stock change approach based on national forest inventory data. Given the potential influence of disturbance events and/or climate change processes, the statistical detection of changes in forest C stocks is paramount to maintaining the net sequestration status of...

  12. ODM Data Analysis-A tool for the automatic validation, monitoring and generation of generic descriptive statistics of patient data.

    Science.gov (United States)

    Brix, Tobias Johannes; Bruland, Philipp; Sarfraz, Saad; Ernsting, Jan; Neuhaus, Philipp; Storck, Michael; Doods, Justin; Ständer, Sonja; Dugas, Martin

    2018-01-01

    A required step for presenting results of clinical studies is the declaration of participants demographic and baseline characteristics as claimed by the FDAAA 801. The common workflow to accomplish this task is to export the clinical data from the used electronic data capture system and import it into statistical software like SAS software or IBM SPSS. This software requires trained users, who have to implement the analysis individually for each item. These expenditures may become an obstacle for small studies. Objective of this work is to design, implement and evaluate an open source application, called ODM Data Analysis, for the semi-automatic analysis of clinical study data. The system requires clinical data in the CDISC Operational Data Model format. After uploading the file, its syntax and data type conformity of the collected data is validated. The completeness of the study data is determined and basic statistics, including illustrative charts for each item, are generated. Datasets from four clinical studies have been used to evaluate the application's performance and functionality. The system is implemented as an open source web application (available at https://odmanalysis.uni-muenster.de) and also provided as Docker image which enables an easy distribution and installation on local systems. Study data is only stored in the application as long as the calculations are performed which is compliant with data protection endeavors. Analysis times are below half an hour, even for larger studies with over 6000 subjects. Medical experts have ensured the usefulness of this application to grant an overview of their collected study data for monitoring purposes and to generate descriptive statistics without further user interaction. The semi-automatic analysis has its limitations and cannot replace the complex analysis of statisticians, but it can be used as a starting point for their examination and reporting.

  13. Data on electrical energy conservation using high efficiency motors for the confidence bounds using statistical techniques.

    Science.gov (United States)

    Shaikh, Muhammad Mujtaba; Memon, Abdul Jabbar; Hussain, Manzoor

    2016-09-01

    In this article, we describe details of the data used in the research paper "Confidence bounds for energy conservation in electric motors: An economical solution using statistical techniques" [1]. The data presented in this paper is intended to show benefits of high efficiency electric motors over the standard efficiency motors of similar rating in the industrial sector of Pakistan. We explain how the data was collected and then processed by means of formulas to show cost effectiveness of energy efficient motors in terms of three important parameters: annual energy saving, cost saving and payback periods. This data can be further used to construct confidence bounds for the parameters using statistical techniques as described in [1].

  14. A multivariate statistical study on a diversified data gathering system for nuclear power plants

    International Nuclear Information System (INIS)

    Samanta, P.K.; Teichmann, T.; Levine, M.M.; Kato, W.Y.

    1989-02-01

    In this report, multivariate statistical methods are presented and applied to demonstrate their use in analyzing nuclear power plant operational data. For analyses of nuclear power plant events, approaches are presented for detecting malfunctions and degradations within the course of the event. At the system level, approaches are investigated as a means of diagnosis of system level performance. This involves the detection of deviations from normal performance of the system. The input data analyzed are the measurable physical parameters, such as steam generator level, pressurizer water level, auxiliary feedwater flow, etc. The study provides the methodology and illustrative examples based on data gathered from simulation of nuclear power plant transients and computer simulation of a plant system performance (due to lack of easily accessible operational data). Such an approach, once fully developed, can be used to explore statistically the detection of failure trends and patterns and prevention of conditions with serious safety implications. 33 refs., 18 figs., 9 tabs

  15. Association testing for next-generation sequencing data using score statistics

    DEFF Research Database (Denmark)

    Skotte, Line; Korneliussen, Thorfinn Sand; Albrechtsen, Anders

    2012-01-01

    computationally feasible due to the use of score statistics. As part of the joint likelihood, we model the distribution of the phenotypes using a generalized linear model framework, which works for both quantitative and discrete phenotypes. Thus, the method presented here is applicable to case-control studies...... of genotype calls into account have been proposed; most require numerical optimization which for large-scale data is not always computationally feasible. We show that using a score statistic for the joint likelihood of observed phenotypes and observed sequencing data provides an attractive approach...... to association testing for next-generation sequencing data. The joint model accounts for the genotype classification uncertainty via the posterior probabilities of the genotypes given the observed sequencing data, which gives the approach higher power than methods based on called genotypes. This strategy remains...

  16. Misuse of statistics in the interpretation of data on low-level radiation

    International Nuclear Information System (INIS)

    Hamilton, L.D.

    1982-01-01

    Four misuses of statistics in the interpretation of data of low-level radiation are reviewed: (1) post-hoc analysis and aggregation of data leading to faulty conclusions in the reanalysis of genetic effects of the atomic bomb, and premature conclusions on the Portsmouth Naval Shipyard data; (2) inappropriate adjustment for age and ignoring differences between urban and rural areas leading to potentially spurious increase in incidence of cancer at Rocky Flats; (3) hazard of summary statistics based on ill-conditioned individual rates leading to spurious association between childhood leukemia and fallout in Utah; and (4) the danger of prematurely published preliminary work with inadequate consideration of epidemiological problems - censored data - leading to inappropriate conclusions, needless alarm at the Portsmouth Naval Shipyard, and diversion of scarce research funds

  17. IEEE Std 101-1987: IEEE guide for the statistical analysis of thermal life test data

    International Nuclear Information System (INIS)

    Anon.

    1992-01-01

    This revision of IEEE Std 101-1972 describes statistical analyses for data from thermally accelerated aging tests. It explains the basis and use of statistical calculations for an engineer or scientist. Accelerated test procedures usually call for a number of specimens to be aged at each of several temperatures appreciably above normal operating temperatures. High temperatures are chosen to produce specimen failures (according to specified failure criteria) in typically one week to one year. The test objective is to determine the dependence of median life on temperature from the data, and to estimate, by extrapolation, the median life to be expected at service temperature. This guide presents methods for analyzing such data and for comparing test data on different materials

  18. Misuse of statistics in the interpretation of data on low-level radiation

    Energy Technology Data Exchange (ETDEWEB)

    Hamilton, L.D.

    1982-01-01

    Four misuses of statistics in the interpretation of data of low-level radiation are reviewed: (1) post-hoc analysis and aggregation of data leading to faulty conclusions in the reanalysis of genetic effects of the atomic bomb, and premature conclusions on the Portsmouth Naval Shipyard data; (2) inappropriate adjustment for age and ignoring differences between urban and rural areas leading to potentially spurious increase in incidence of cancer at Rocky Flats; (3) hazard of summary statistics based on ill-conditioned individual rates leading to spurious association between childhood leukemia and fallout in Utah; and (4) the danger of prematurely published preliminary work with inadequate consideration of epidemiological problems - censored data - leading to inappropriate conclusions, needless alarm at the Portsmouth Naval Shipyard, and diversion of scarce research funds.

  19. A framework for the economic analysis of data collection methods for vital statistics.

    Science.gov (United States)

    Jimenez-Soto, Eliana; Hodge, Andrew; Nguyen, Kim-Huong; Dettrick, Zoe; Lopez, Alan D

    2014-01-01

    Over recent years there has been a strong movement towards the improvement of vital statistics and other types of health data that inform evidence-based policies. Collecting such data is not cost free. To date there is no systematic framework to guide investment decisions on methods of data collection for vital statistics or health information in general. We developed a framework to systematically assess the comparative costs and outcomes/benefits of the various data methods for collecting vital statistics. The proposed framework is four-pronged and utilises two major economic approaches to systematically assess the available data collection methods: cost-effectiveness analysis and efficiency analysis. We built a stylised example of a hypothetical low-income country to perform a simulation exercise in order to illustrate an application of the framework. Using simulated data, the results from the stylised example show that the rankings of the data collection methods are not affected by the use of either cost-effectiveness or efficiency analysis. However, the rankings are affected by how quantities are measured. There have been several calls for global improvements in collecting useable data, including vital statistics, from health information systems to inform public health policies. Ours is the first study that proposes a systematic framework to assist countries undertake an economic evaluation of DCMs. Despite numerous challenges, we demonstrate that a systematic assessment of outputs and costs of DCMs is not only necessary, but also feasible. The proposed framework is general enough to be easily extended to other areas of health information.

  20. Statistical transformation and the interpretation of inpatient glucose control data from the intensive care unit.

    Science.gov (United States)

    Saulnier, George E; Castro, Janna C; Cook, Curtiss B

    2014-05-01

    Glucose control can be problematic in critically ill patients. We evaluated the impact of statistical transformation on interpretation of intensive care unit inpatient glucose control data. Point-of-care blood glucose (POC-BG) data derived from patients in the intensive care unit for 2011 was obtained. Box-Cox transformation of POC-BG measurements was performed, and distribution of data was determined before and after transformation. Different data subsets were used to establish statistical upper and lower control limits. Exponentially weighted moving average (EWMA) control charts constructed from April, October, and November data determined whether out-of-control events could be identified differently in transformed versus nontransformed data. A total of 8679 POC-BG values were analyzed. POC-BG distributions in nontransformed data were skewed but approached normality after transformation. EWMA control charts revealed differences in projected detection of out-of-control events. In April, an out-of-control process resulting in the lower control limit being exceeded was identified at sample 116 in nontransformed data but not in transformed data. October transformed data detected an out-of-control process exceeding the upper control limit at sample 27 that was not detected in nontransformed data. Nontransformed November results remained in control, but transformation identified an out-of-control event less than 10 samples into the observation period. Using statistical methods to assess population-based glucose control in the intensive care unit could alter conclusions about the effectiveness of care processes for managing hyperglycemia. Further study is required to determine whether transformed versus nontransformed data change clinical decisions about the interpretation of care or intervention results. © 2014 Diabetes Technology Society.

  1. Powerful Inference with the D-Statistic on Low-Coverage Whole-Genome Data.

    Science.gov (United States)

    Soraggi, Samuele; Wiuf, Carsten; Albrechtsen, Anders

    2018-02-02

    The detection of ancient gene flow between human populations is an important issue in population genetics. A common tool for detecting ancient admixture events is the D-statistic. The D-statistic is based on the hypothesis of a genetic relationship that involves four populations, whose correctness is assessed by evaluating specific coincidences of alleles between the groups. When working with high-throughput sequencing data, calling genotypes accurately is not always possible; therefore, the D-statistic currently samples a single base from the reads of one individual per population. This implies ignoring much of the information in the data, an issue especially striking in the case of ancient genomes. We provide a significant improvement to overcome the problems of the D-statistic by considering all reads from multiple individuals in each population. We also apply type-specific error correction to combat the problems of sequencing errors, and show a way to correct for introgression from an external population that is not part of the supposed genetic relationship, and how this leads to an estimate of the admixture rate. We prove that the D-statistic is approximated by a standard normal distribution. Furthermore, we show that our method outperforms the traditional D-statistic in detecting admixtures. The power gain is most pronounced for low and medium sequencing depth (1-10×), and performances are as good as with perfectly called genotypes at a sequencing depth of 2×. We show the reliability of error correction in scenarios with simulated errors and ancient data, and correct for introgression in known scenarios to estimate the admixture rates. Copyright © 2018 Soraggi et al.

  2. ROOT — A C++ framework for petabyte data storage, statistical analysis and visualization

    CERN Document Server

    Antcheva, I; Bellenot, B; Biskup,1, M; Brun, R; Buncic, N; Canal, Ph; Casadei, D; Couet, O; Fine, V; Franco,1, L; Ganis, G; Gheata, A; Gonzalez Maline, D; Goto, M; Iwaszkiewicz, J; Kreshuk, A; Marcos Segura, D; Maunder, R; Moneta, L; Naumann, A; Offermann, E; Onuchin, V; Panacek, S; Rademakers, F; Russo, P; Tadel, M

    2009-01-01

    ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web, or a number of different shared file systems. In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, the RooFit package allows the user to perform complex data modeling and fitting while the RooStats library provides abstractions and implementations for advanced statistical tools. Multivariat...

  3. Study on loss detection algorithms for tank monitoring data using multivariate statistical analysis

    International Nuclear Information System (INIS)

    Suzuki, Mitsutoshi; Burr, Tom

    2009-01-01

    Evaluation of solution monitoring data to support material balance evaluation was proposed about a decade ago because of concerns regarding the large throughput planned at Rokkasho Reprocessing Plant (RRP). A numerical study using the simulation code (FACSIM) was done and significant increases in the detection probabilities (DP) for certain types of losses were shown. To be accepted internationally, it is very important to verify such claims using real solution monitoring data. However, a demonstrative study with real tank data has not been carried out due to the confidentiality of the tank data. This paper describes an experimental study that has been started using actual data from the Solution Measurement and Monitoring System (SMMS) in the Tokai Reprocessing Plant (TRP) and the Savannah River Site (SRS). Multivariate statistical methods, such as a vector cumulative sum and a multi-scale statistical analysis, have been applied to the real tank data that have superimposed simulated loss. Although quantitative conclusions have not been derived for the moment due to the difficulty of baseline evaluation, the multivariate statistical methods remain promising for abrupt and some types of protracted loss detection. (author)

  4. Procedure for statistical analysis of one-parameter discrepant experimental data

    International Nuclear Information System (INIS)

    Badikov, Sergey A.; Chechev, Valery P.

    2012-01-01

    A new, Mandel–Paule-type procedure for statistical processing of one-parameter discrepant experimental data is described. The procedure enables one to estimate a contribution of unrecognized experimental errors into the total experimental uncertainty as well as to include it in analysis. A definition of discrepant experimental data for an arbitrary number of measurements is introduced as an accompanying result. In the case of negligible unrecognized experimental errors, the procedure simply reduces to the calculation of the weighted average and its internal uncertainty. The procedure was applied to the statistical analysis of half-life experimental data; Mean half-lives for 20 actinides were calculated and results were compared to the ENSDF and DDEP evaluations. On the whole, the calculated half-lives are consistent with the ENSDF and DDEP evaluations. However, the uncertainties calculated in this work essentially exceed the ENSDF and DDEP evaluations for discrepant experimental data. This effect can be explained by adequately taking into account unrecognized experimental errors. - Highlights: ► A new statistical procedure for processing one-parametric discrepant experimental data has been presented. ► Procedure estimates a contribution of unrecognized errors in the total experimental uncertainty. ► Procedure was applied for processing half-life discrepant experimental data. ► Results of the calculations are compared to the ENSDF and DDEP evaluations.

  5. Bayesian statistics applied to neutron activation data for reactor flux spectrum analysis

    International Nuclear Information System (INIS)

    Chiesa, Davide; Previtali, Ezio; Sisti, Monica

    2014-01-01

    Highlights: • Bayesian statistics to analyze the neutron flux spectrum from activation data. • Rigorous statistical approach for accurate evaluation of the neutron flux groups. • Cross section and activation data uncertainties included for the problem solution. • Flexible methodology applied to analyze different nuclear reactor flux spectra. • The results are in good agreement with the MCNP simulations of neutron fluxes. - Abstract: In this paper, we present a statistical method, based on Bayesian statistics, to analyze the neutron flux spectrum from the activation data of different isotopes. The experimental data were acquired during a neutron activation experiment performed at the TRIGA Mark II reactor of Pavia University (Italy) in four irradiation positions characterized by different neutron spectra. In order to evaluate the neutron flux spectrum, subdivided in energy groups, a system of linear equations, containing the group effective cross sections and the activation rate data, has to be solved. However, since the system’s coefficients are experimental data affected by uncertainties, a rigorous statistical approach is fundamental for an accurate evaluation of the neutron flux groups. For this purpose, we applied the Bayesian statistical analysis, that allows to include the uncertainties of the coefficients and the a priori information about the neutron flux. A program for the analysis of Bayesian hierarchical models, based on Markov Chain Monte Carlo (MCMC) simulations, was used to define the problem statistical model and solve it. The first analysis involved the determination of the thermal, resonance-intermediate and fast flux components and the dependence of the results on the Prior distribution choice was investigated to confirm the reliability of the Bayesian analysis. After that, the main resonances of the activation cross sections were analyzed to implement multi-group models with finer energy subdivisions that would allow to determine the

  6. Calculation of Tajima's D and other neutrality test statistics from low depth next-generation sequencing data

    DEFF Research Database (Denmark)

    Korneliussen, Thorfinn Sand; Moltke, Ida; Albrechtsen, Anders

    2013-01-01

    A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima's D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. Howeve......, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions....

  7. Sources of Safety Data and Statistical Strategies for Design and Analysis: Postmarket Surveillance.

    Science.gov (United States)

    Izem, Rima; Sanchez-Kam, Matilde; Ma, Haijun; Zink, Richard; Zhao, Yueqin

    2018-03-01

    Safety data are continuously evaluated throughout the life cycle of a medical product to accurately assess and characterize the risks associated with the product. The knowledge about a medical product's safety profile continually evolves as safety data accumulate. This paper discusses data sources and analysis considerations for safety signal detection after a medical product is approved for marketing. This manuscript is the second in a series of papers from the American Statistical Association Biopharmaceutical Section Safety Working Group. We share our recommendations for the statistical and graphical methodologies necessary to appropriately analyze, report, and interpret safety outcomes, and we discuss the advantages and disadvantages of safety data obtained from passive postmarketing surveillance systems compared to other sources. Signal detection has traditionally relied on spontaneous reporting databases that have been available worldwide for decades. However, current regulatory guidelines and ease of reporting have increased the size of these databases exponentially over the last few years. With such large databases, data-mining tools using disproportionality analysis and helpful graphics are often used to detect potential signals. Although the data sources have many limitations, analyses of these data have been successful at identifying safety signals postmarketing. Experience analyzing these dynamic data is useful in understanding the potential and limitations of analyses with new data sources such as social media, claims, or electronic medical records data.

  8. Modeling landslide susceptibility in data-scarce environments using optimized data mining and statistical methods

    Science.gov (United States)

    Lee, Jung-Hyun; Sameen, Maher Ibrahim; Pradhan, Biswajeet; Park, Hyuck-Jin

    2018-02-01

    This study evaluated the generalizability of five models to select a suitable approach for landslide susceptibility modeling in data-scarce environments. In total, 418 landslide inventories and 18 landslide conditioning factors were analyzed. Multicollinearity and factor optimization were investigated before data modeling, and two experiments were then conducted. In each experiment, five susceptibility maps were produced based on support vector machine (SVM), random forest (RF), weight-of-evidence (WoE), ridge regression (Rid_R), and robust regression (RR) models. The highest accuracy (AUC = 0.85) was achieved with the SVM model when either the full or limited landslide inventories were used. Furthermore, the RF and WoE models were severely affected when less landslide samples were used for training. The other models were affected slightly when the training samples were limited.

  9. Uncertainty analysis of reactor safety systems with statistically correlated failure data

    International Nuclear Information System (INIS)

    Dezfuli, H.; Modarres, M.

    1985-01-01

    The probability of occurrence of the top event of a fault tree is estimated from failure probability of components that constitute the fault tree. Component failure probabilities are subject to statistical uncertainties. In addition, there are cases where the failure data are statistically correlated. Most fault tree evaluations have so far been based on uncorrelated component failure data. The subject of this paper is the description of a method of assessing the probability intervals for the top event failure probability of fault trees when component failure data are statistically correlated. To estimate the mean and variance of the top event, a second-order system moment method is presented through Taylor series expansion, which provides an alternative to the normally used Monte-Carlo method. For cases where component failure probabilities are statistically correlated, the Taylor expansion terms are treated properly. A moment matching technique is used to obtain the probability distribution function of the top event through fitting a Johnson Ssub(B) distribution. The computer program (CORRELATE) was developed to perform the calculations necessary for the implementation of the method developed. The CORRELATE code is very efficient and consumes minimal computer time. This is primarily because it does not employ the time-consuming Monte-Carlo method. (author)

  10. mapDIA: Preprocessing and statistical analysis of quantitative proteomics data from data independent acquisition mass spectrometry.

    Science.gov (United States)

    Teo, Guoshou; Kim, Sinae; Tsou, Chih-Chiang; Collins, Ben; Gingras, Anne-Claude; Nesvizhskii, Alexey I; Choi, Hyungwon

    2015-11-03

    Data independent acquisition (DIA) mass spectrometry is an emerging technique that offers more complete detection and quantification of peptides and proteins across multiple samples. DIA allows fragment-level quantification, which can be considered as repeated measurements of the abundance of the corresponding peptides and proteins in the downstream statistical analysis. However, few statistical approaches are available for aggregating these complex fragment-level data into peptide- or protein-level statistical summaries. In this work, we describe a software package, mapDIA, for statistical analysis of differential protein expression using DIA fragment-level intensities. The workflow consists of three major steps: intensity normalization, peptide/fragment selection, and statistical analysis. First, mapDIA offers normalization of fragment-level intensities by total intensity sums as well as a novel alternative normalization by local intensity sums in retention time space. Second, mapDIA removes outlier observations and selects peptides/fragments that preserve the major quantitative patterns across all samples for each protein. Last, using the selected fragments and peptides, mapDIA performs model-based statistical significance analysis of protein-level differential expression between specified groups of samples. Using a comprehensive set of simulation datasets, we show that mapDIA detects differentially expressed proteins with accurate control of the false discovery rates. We also describe the analysis procedure in detail using two recently published DIA datasets generated for 14-3-3β dynamic interaction network and prostate cancer glycoproteome. The software was written in C++ language and the source code is available for free through SourceForge website http://sourceforge.net/projects/mapdia/.This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015 Elsevier B.V. All rights reserved.

  11. A knowledge-based T2-statistic to perform pathway analysis for quantitative proteomic data.

    Science.gov (United States)

    Lai, En-Yu; Chen, Yi-Hau; Wu, Kun-Pin

    2017-06-01

    Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https

  12. Statistical Analysis for High-Dimensional Data : The Abel Symposium 2014

    CERN Document Server

    Bühlmann, Peter; Glad, Ingrid; Langaas, Mette; Richardson, Sylvia; Vannucci, Marina

    2016-01-01

    This book features research contributions from The Abel Symposium on Statistical Analysis for High Dimensional Data, held in Nyvågar, Lofoten, Norway, in May 2014. The focus of the symposium was on statistical and machine learning methodologies specifically developed for inference in “big data” situations, with particular reference to genomic applications. The contributors, who are among the most prominent researchers on the theory of statistics for high dimensional inference, present new theories and methods, as well as challenging applications and computational solutions. Specific themes include, among others, variable selection and screening, penalised regression, sparsity, thresholding, low dimensional structures, computational challenges, non-convex situations, learning graphical models, sparse covariance and precision matrices, semi- and non-parametric formulations, multiple testing, classification, factor models, clustering, and preselection. Highlighting cutting-edge research and casting light on...

  13. Review of Naked Statistics: Stripping the Dread from Data by Charles Wheelan

    Directory of Open Access Journals (Sweden)

    Michael T. Catalano

    2015-01-01

    Full Text Available Wheelan, Charles. Naked Statistics: Stripping the Dread from Data (New York, NY, W. W. Norton & Company, 2014. 282 pp. ISBN 978-0-393-07195-5 In his review of What Numbers Say and The Numbers Game, Rob Root (Numeracy 3(1: 9 writes “Popular books on quantitative literacy need to be easy to read, reasonably comprehensive in scope, and include examples that are thought-provoking and memorable.” Wheelan’s book certainly meets this description, and should be of interest to both the general public and those with a professional interest in numeracy. A moderately diligent learner can get a decent understanding of basic statistics from the book. Teachers of statistics and quantitative literacy will find a wealth of well-related examples and stories to use in their classes.

  14. Precipitate statistics in an Al-Mg-Si-Cu alloy from scanning precession electron diffraction data

    Science.gov (United States)

    Sunde, J. K.; Paulsen, Ø.; Wenner, S.; Holmestad, R.

    2017-09-01

    The key microstructural feature providing strength to age-hardenable Al alloys is nanoscale precipitates. Alloy development requires a reliable statistical assessment of these precipitates, in order to link the microstructure with material properties. Here, it is demonstrated that scanning precession electron diffraction combined with computational analysis enable the semi-automated extraction of precipitate statistics in an Al-Mg-Si-Cu alloy. Among the main findings is the precipitate number density, which agrees well with a conventional method based on manual counting and measurements. By virtue of its data analysis objectivity, our methodology is therefore seen as an advantageous alternative to existing routines, offering reproducibility and efficiency in alloy statistics. Additional results include improved qualitative information on phase distributions. The developed procedure is generic and applicable to any material containing nanoscale precipitates.

  15. Mourning dove hunting regulation strategy based on annual harvest statistics and banding data

    Science.gov (United States)

    Otis, D.L.

    2006-01-01

    Although managers should strive to base game bird harvest management strategies on mechanistic population models, monitoring programs required to build and continuously update these models may not be in place. Alternatively, If estimates of total harvest and harvest rates are available, then population estimates derived from these harvest data can serve as the basis for making hunting regulation decisions based on population growth rates derived from these estimates. I present a statistically rigorous approach for regulation decision-making using a hypothesis-testing framework and an assumed framework of 3 hunting regulation alternatives. I illustrate and evaluate the technique with historical data on the mid-continent mallard (Anas platyrhynchos) population. I evaluate the statistical properties of the hypothesis-testing framework using the best available data on mourning doves (Zenaida macroura). I use these results to discuss practical implementation of the technique as an interim harvest strategy for mourning doves until reliable mechanistic population models and associated monitoring programs are developed.

  16. Statistical analysis of solid waste composition data: Arithmetic mean, standard deviation and correlation coefficients

    DEFF Research Database (Denmark)

    Edjabou, Maklawe Essonanawe; Martín-Fernández, Josep Antoni; Scheutz, Charlotte

    2017-01-01

    -derived food waste amounted to 2.21 ± 3.12% with a confidence interval of (−4.03; 8.45), which highlights the problem of the biased negative proportions. A Pearson’s correlation test, applied to waste fraction generation (kg mass), indicated a positive correlation between avoidable vegetable food waste...... and plastic packaging. However, correlation tests applied to waste fraction compositions (percentage values) showed a negative association in this regard, thus demonstrating that statistical analyses applied to compositional waste fraction data, without addressing the closed characteristics of these data......, have the potential to generate spurious or misleading results. Therefore, ¨compositional data should be transformed adequately prior to any statistical analysis, such as computing mean, standard deviation and correlation coefficients....

  17. Statistical analysis of proteomics, metabolomics, and lipidomics data using mass spectrometry

    CERN Document Server

    Mertens, Bart

    2017-01-01

    This book presents an overview of computational and statistical design and analysis of mass spectrometry-based proteomics, metabolomics, and lipidomics data. This contributed volume provides an introduction to the special aspects of statistical design and analysis with mass spectrometry data for the new omic sciences. The text discusses common aspects of design and analysis between and across all (or most) forms of mass spectrometry, while also providing special examples of application with the most common forms of mass spectrometry. Also covered are applications of computational mass spectrometry not only in clinical study but also in the interpretation of omics data in plant biology studies. Omics research fields are expected to revolutionize biomolecular research by the ability to simultaneously profile many compounds within either patient blood, urine, tissue, or other biological samples. Mass spectrometry is one of the key analytical techniques used in these new omic sciences. Liquid chromatography mass ...

  18. Equipment Maintenance management support system based on statistical analysis of maintenance history data

    International Nuclear Information System (INIS)

    Shimizu, S.; Ando, Y.; Morioka, T.

    1990-01-01

    Plant maintenance is recently becoming important with the increase in the number of nuclear power stations and in plant operating time. Various kinds of requirements for plant maintenance, such as countermeasures for equipment degradation and saving maintenance costs while keeping up plant reliability and productivity, are proposed. For this purpose, plant maintenance programs should be improved based on equipment reliability estimated by field data. In order to meet these requirements, it is planned to develop an equipment maintenance management support system for nuclear power plants based on statistical analysis of equipment maintenance history data. The large difference between this proposed new method and current similar methods is to evaluate not only failure data but maintenance data, which includes normal termination data and some degree of degradation or functional disorder data for equipment and parts. So, it is possible to utilize these field data for improving maintenance schedules and to evaluate actual equipment and parts reliability under the current maintenance schedule. In the present paper, the authors show the objectives of this system, an outline of this system and its functions, and the basic technique for collecting and managing of maintenance history data on statistical analysis. It is shown, from the results of feasibility tests using simulation data of maintenance history, that this system has the ability to provide useful information for maintenance and the design enhancement

  19. Steam Generator Group Project. Progress report on data acquisition/statistical analysis

    International Nuclear Information System (INIS)

    Doctor, P.G.; Buchanan, J.A.; McIntyre, J.M.; Hof, P.J.; Ercanbrack, S.S.

    1984-01-01

    A major task of the Steam Generator Group Project (SGGP) is to establish the reliability of the eddy current inservice inspections of PWR steam generator tubing, by comparing the eddy current data to the actual physical condition of the tubes via destructive analyses. This report describes the plans for the computer systems needed to acquire, store and analyze the diverse data to be collected during the project. The real-time acquisition of the baseline eddy current inspection data will be handled using a specially designed data acquisition computer system based on a Digital Equipment Corporation (DEC) PDP-11/44. The data will be archived in digital form for use after the project is completed. Data base management and statistical analyses will be done on a DEC VAX-11/780. Color graphics will be heavily used to summarize the data and the results of the analyses. The report describes the data that will be taken during the project and the statistical methods that will be used to analyze the data. 7 figures, 2 tables

  20. A weighted U-statistic for genetic association analyses of sequencing data.

    Science.gov (United States)

    Wei, Changshuai; Li, Ming; He, Zihuai; Vsevolozhskaya, Olga; Schaid, Daniel J; Lu, Qing

    2014-12-01

    With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol. © 2014 WILEY PERIODICALS, INC.

  1. Robust functional statistics applied to Probability Density Function shape screening of sEMG data.

    Science.gov (United States)

    Boudaoud, S; Rix, H; Al Harrach, M; Marin, F

    2014-01-01

    Recent studies pointed out possible shape modifications of the Probability Density Function (PDF) of surface electromyographical (sEMG) data according to several contexts like fatigue and muscle force increase. Following this idea, criteria have been proposed to monitor these shape modifications mainly using High Order Statistics (HOS) parameters like skewness and kurtosis. In experimental conditions, these parameters are confronted with small sample size in the estimation process. This small sample size induces errors in the estimated HOS parameters restraining real-time and precise sEMG PDF shape monitoring. Recently, a functional formalism, the Core Shape Model (CSM), has been used to analyse shape modifications of PDF curves. In this work, taking inspiration from CSM method, robust functional statistics are proposed to emulate both skewness and kurtosis behaviors. These functional statistics combine both kernel density estimation and PDF shape distances to evaluate shape modifications even in presence of small sample size. Then, the proposed statistics are tested, using Monte Carlo simulations, on both normal and Log-normal PDFs that mimic observed sEMG PDF shape behavior during muscle contraction. According to the obtained results, the functional statistics seem to be more robust than HOS parameters to small sample size effect and more accurate in sEMG PDF shape screening applications.

  2. Error analysis of terrestrial laser scanning data by means of spherical statistics and 3D graphs.

    Science.gov (United States)

    Cuartero, Aurora; Armesto, Julia; Rodríguez, Pablo G; Arias, Pedro

    2010-01-01

    This paper presents a complete analysis of the positional errors of terrestrial laser scanning (TLS) data based on spherical statistics and 3D graphs. Spherical statistics are preferred because of the 3D vectorial nature of the spatial error. Error vectors have three metric elements (one module and two angles) that were analyzed by spherical statistics. A study case has been presented and discussed in detail. Errors were calculating using 53 check points (CP) and CP coordinates were measured by a digitizer with submillimetre accuracy. The positional accuracy was analyzed by both the conventional method (modular errors analysis) and the proposed method (angular errors analysis) by 3D graphics and numerical spherical statistics. Two packages in R programming language were performed to obtain graphics automatically. The results indicated that the proposed method is advantageous as it offers a more complete analysis of the positional accuracy, such as angular error component, uniformity of the vector distribution, error isotropy, and error, in addition the modular error component by linear statistics.

  3. BIG-DATA and the Challenges for Statistical Inference and Economics Teaching and Learning

    Directory of Open Access Journals (Sweden)

    J.L. Peñaloza Figueroa

    2017-04-01

    Full Text Available The  increasing  automation  in  data  collection,  either  in  structured  or unstructured formats, as well as the development of reading, concatenation and comparison algorithms and the growing analytical skills which characterize the era of Big Data, cannot not only be considered a technological achievement, but an organizational, methodological and analytical challenge for knowledge as well, which is necessary to generate opportunities and added value. In fact, exploiting the potential of Big-Data includes all fields of community activity; and given its ability to extract behaviour patterns, we are interested in the challenges for the field of teaching and learning, particularly in the field of statistical inference and economic theory. Big-Data can improve the understanding of concepts, models and techniques used in both statistical inference and economic theory, and it can also generate reliable and robust short and long term predictions. These facts have led to the demand for analytical capabilities, which in turn encourages teachers and students to demand access to massive information produced by individuals, companies and public and private organizations in their transactions and inter- relationships. Mass data (Big Data is changing the way people access, understand and organize knowledge, which in turn is causing a shift in the approach to statistics and economics teaching, considering them as a real way of thinking rather than just operational and technical disciplines. Hence, the question is how teachers can use automated collection and analytical skills to their advantage when teaching statistics and economics; and whether it will lead to a change in what is taught and how it is taught.

  4. Pilot points method for conditioning multiple-point statistical facies simulation on flow data

    Science.gov (United States)

    Ma, Wei; Jafarpour, Behnam

    2018-05-01

    We propose a new pilot points method for conditioning discrete multiple-point statistical (MPS) facies simulation on dynamic flow data. While conditioning MPS simulation on static hard data is straightforward, their calibration against nonlinear flow data is nontrivial. The proposed method generates conditional models from a conceptual model of geologic connectivity, known as a training image (TI), by strategically placing and estimating pilot points. To place pilot points, a score map is generated based on three sources of information: (i) the uncertainty in facies distribution, (ii) the model response sensitivity information, and (iii) the observed flow data. Once the pilot points are placed, the facies values at these points are inferred from production data and then are used, along with available hard data at well locations, to simulate a new set of conditional facies realizations. While facies estimation at the pilot points can be performed using different inversion algorithms, in this study the ensemble smoother (ES) is adopted to update permeability maps from production data, which are then used to statistically infer facies types at the pilot point locations. The developed method combines the information in the flow data and the TI by using the former to infer facies values at selected locations away from the wells and the latter to ensure consistent facies structure and connectivity where away from measurement locations. Several numerical experiments are used to evaluate the performance of the developed method and to discuss its important properties.

  5. Vector-field statistics for the analysis of time varying clinical gait data.

    Science.gov (United States)

    Donnelly, C J; Alexander, C; Pataky, T C; Stannage, K; Reid, S; Robinson, M A

    2017-01-01

    In clinical settings, the time varying analysis of gait data relies heavily on the experience of the individual(s) assessing these biological signals. Though three dimensional kinematics are recognised as time varying waveforms (1D), exploratory statistical analysis of these data are commonly carried out with multiple discrete or 0D dependent variables. In the absence of an a priori 0D hypothesis, clinicians are at risk of making type I and II errors in their analyis of time varying gait signatures in the event statistics are used in concert with prefered subjective clinical assesment methods. The aim of this communication was to determine if vector field waveform statistics were capable of providing quantitative corroboration to practically significant differences in time varying gait signatures as determined by two clinically trained gait experts. The case study was a left hemiplegic Cerebral Palsy (GMFCS I) gait patient following a botulinum toxin (BoNT-A) injection to their left gastrocnemius muscle. When comparing subjective clinical gait assessments between two testers, they were in agreement with each other for 61% of the joint degrees of freedom and phases of motion analysed. For tester 1 and tester 2, they were in agreement with the vector-field analysis for 78% and 53% of the kinematic variables analysed. When the subjective analyses of tester 1 and tester 2 were pooled together and then compared to the vector-field analysis, they were in agreement for 83% of the time varying kinematic variables analysed. These outcomes demonstrate that in principle, vector-field statistics corroborates with what a team of clinical gait experts would classify as practically meaningful pre- versus post time varying kinematic differences. The potential for vector-field statistics to be used as a useful clinical tool for the objective analysis of time varying clinical gait data is established. Future research is recommended to assess the usefulness of vector-field analyses

  6. Statistical methods to detect novel genetic variants using publicly available GWAS summary data.

    Science.gov (United States)

    Guo, Bin; Wu, Baolin

    2018-03-01

    We propose statistical methods to detect novel genetic variants using only genome-wide association studies (GWAS) summary data without access to raw genotype and phenotype data. With more and more summary data being posted for public access in the post GWAS era, the proposed methods are practically very useful to identify additional interesting genetic variants and shed lights on the underlying disease mechanism. We illustrate the utility of our proposed methods with application to GWAS meta-analysis results of fasting glucose from the international MAGIC consortium. We found several novel genome-wide significant loci that are worth further study. Copyright © 2018 Elsevier Ltd. All rights reserved.

  7. A test statistic in the complex Wishart distribution and its application to change detection in polarimetric SAR data

    DEFF Research Database (Denmark)

    Conradsen, Knut; Nielsen, Allan Aasbjerg; Schou, Jesper

    2003-01-01

    . Based on this distribution, a test statistic for equality of two such matrices and an associated asymptotic probability for obtaining a smaller value of the test statistic are derived and applied successfully to change detection in polarimetric SAR data. In a case study, EMISAR L-band data from April 17...... to HH, VV, or HV data alone, the derived test statistic reduces to the well-known gamma likelihood-ratio test statistic. The derived test statistic and the associated significance value can be applied as a line or edge detector in fully polarimetric SAR data also....

  8. Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies

    Directory of Open Access Journals (Sweden)

    Hamid Reza Marateb

    2014-01-01

    Full Text Available Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal-variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD. Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables.

  9. Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies

    Science.gov (United States)

    Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario

    2014-01-01

    Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565

  10. Visualization of time series statistical data by shape analysis (GDP ratio changes among Asia countries)

    Science.gov (United States)

    Shirota, Yukari; Hashimoto, Takako; Fitri Sari, Riri

    2018-03-01

    It has been very significant to visualize time series big data. In the paper we shall discuss a new analysis method called “statistical shape analysis” or “geometry driven statistics” on time series statistical data in economics. In the paper, we analyse the agriculture, value added and industry, value added (percentage of GDP) changes from 2000 to 2010 in Asia. We handle the data as a set of landmarks on a two-dimensional image to see the deformation using the principal components. The point of the analysis method is the principal components of the given formation which are eigenvectors of its bending energy matrix. The local deformation can be expressed as the set of non-Affine transformations. The transformations give us information about the local differences between in 2000 and in 2010. Because the non-Affine transformation can be decomposed into a set of partial warps, we present the partial warps visually. The statistical shape analysis is widely used in biology but, in economics, no application can be found. In the paper, we investigate its potential to analyse the economic data.

  11. Statistical comparisons of Savannah River anemometer data applied to quality control of instrument networks

    International Nuclear Information System (INIS)

    Porch, W.M.; Dickerson, M.H.

    1976-08-01

    Continuous monitoring of extensive meteorological instrument arrays is a requirement in the study of important mesoscale atmospheric phenomena. The phenomena include pollution transport prediction from continuous area sources, or one time releases of toxic materials and wind energy prospecting in areas of topographic enhancement of the wind. Quality control techniques that can be applied to these data to determine if the instruments are operating within their prescribed tolerances were investigated. Savannah River Plant data were analyzed with both independent and comparative statistical techniques. The independent techniques calculate the mean, standard deviation, moments about the mean, kurtosis, skewness, probability density distribution, cumulative probability and power spectra. The comparative techniques include covariance, cross-spectral analysis and two dimensional probability density. At present the calculating and plotting routines for these statistical techniques do not reside in a single code so it is difficult to ascribe independent memory size and computation time accurately. However, given the flexibility of a data system which includes simple and fast running statistics at the instrument end of the data network (ASF) and more sophisticated techniques at the computational end (ACF) a proper balance will be attained. These techniques are described in detail and preliminary results are presented

  12. Study design and statistical analysis of data in human population studies with the micronucleus assay.

    Science.gov (United States)

    Ceppi, Marcello; Gallo, Fabio; Bonassi, Stefano

    2011-01-01

    The most common study design performed in population studies based on the micronucleus (MN) assay, is the cross-sectional study, which is largely performed to evaluate the DNA damaging effects of exposure to genotoxic agents in the workplace, in the environment, as well as from diet or lifestyle factors. Sample size is still a critical issue in the design of MN studies since most recent studies considering gene-environment interaction, often require a sample size of several hundred subjects, which is in many cases difficult to achieve. The control of confounding is another major threat to the validity of causal inference. The most popular confounders considered in population studies using MN are age, gender and smoking habit. Extensive attention is given to the assessment of effect modification, given the increasing inclusion of biomarkers of genetic susceptibility in the study design. Selected issues concerning the statistical treatment of data have been addressed in this mini-review, starting from data description, which is a critical step of statistical analysis, since it allows to detect possible errors in the dataset to be analysed and to check the validity of assumptions required for more complex analyses. Basic issues dealing with statistical analysis of biomarkers are extensively evaluated, including methods to explore the dose-response relationship among two continuous variables and inferential analysis. A critical approach to the use of parametric and non-parametric methods is presented, before addressing the issue of most suitable multivariate models to fit MN data. In the last decade, the quality of statistical analysis of MN data has certainly evolved, although even nowadays only a small number of studies apply the Poisson model, which is the most suitable method for the analysis of MN data.

  13. Missing data imputation using statistical and machine learning methods in a real breast cancer problem.

    Science.gov (United States)

    Jerez, José M; Molina, Ignacio; García-Laencina, Pedro J; Alba, Emilio; Ribelles, Nuria; Martín, Miguel; Franco, Leonardo

    2010-10-01

    Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures. Copyright © 2010 Elsevier B.V. All rights reserved.

  14. Statistical analysis of accident data associated with sea transport (data from 1994-1997). Annex 1

    International Nuclear Information System (INIS)

    Schneider, T.; Tabarre, M.; Armingaud, F.

    2001-01-01

    This analysis is based on Lloyd's database concerning sea transport accidents for the 1994-1997 period and completes the previous analysis based on 1994 data. It gives an accurate description of the world fleet and the most severe ship accidents (total losses), as well as the frequencies of accident (in average on the 1994-1997 period the frequency of accident for cargo carrying ships is 2.57.10 -3 loss /ship.year). Furthermore, an analysis has been performed on the ship casualties recorded by the Marine Accident Investigation Branch (MAIB) for UK vessels for the 1990-1996 period, this database including all accidents for which a declaration has been made to authorities (for example, the average frequency of fires derived from this analysis is 1.36.10 -2 per ship.year, this occurrence corresponding to the occurrence of initiating events of fire). Concerning fire accidents aboard ships supposed to be representative of the radioactive material transporters, a specific analysis was achieved by the French Bureau Veritas, on a selection of the world casualties (total losses) for the 1978-1988 period. This analysis related to the origin of the fire points out that it originates mainly in the machinery room and quarters. In a few cases the fire duration recorded is more than one day. (author)

  15. Scripts for TRUMP data analyses. Part II (HLA-related data): statistical analyses specific for hematopoietic stem cell transplantation.

    Science.gov (United States)

    Kanda, Junya

    2016-01-01

    The Transplant Registry Unified Management Program (TRUMP) made it possible for members of the Japan Society for Hematopoietic Cell Transplantation (JSHCT) to analyze large sets of national registry data on autologous and allogeneic hematopoietic stem cell transplantation. However, as the processes used to collect transplantation information are complex and differed over time, the background of these processes should be understood when using TRUMP data. Previously, information on the HLA locus of patients and donors had been collected using a questionnaire-based free-description method, resulting in some input errors. To correct minor but significant errors and provide accurate HLA matching data, the use of a Stata or EZR/R script offered by the JSHCT is strongly recommended when analyzing HLA data in the TRUMP dataset. The HLA mismatch direction, mismatch counting method, and different impacts of HLA mismatches by stem cell source are other important factors in the analysis of HLA data. Additionally, researchers should understand the statistical analyses specific for hematopoietic stem cell transplantation, such as competing risk, landmark analysis, and time-dependent analysis, to correctly analyze transplant data. The data center of the JSHCT can be contacted if statistical assistance is required.

  16. Remote sensing of atmospheric water content from Bhaskara SAMIR data. [using statistical linear regression analysis

    Science.gov (United States)

    Gohil, B. S.; Hariharan, T. A.; Sharma, A. K.; Pandey, P. C.

    1982-01-01

    The 19.35 GHz and 22.235 GHz passive microwave radiometers (SAMIR) on board the Indian satellite Bhaskara have provided very useful data. From these data has been demonstrated the feasibility of deriving atmospheric and ocean surface parameters such as water vapor content, liquid water content, rainfall rate and ocean surface winds. Different approaches have been tried for deriving the atmospheric water content. The statistical and empirical methods have been used by others for the analysis of the Nimbus data. A simulation technique has been attempted for the first time for 19.35 GHz and 22.235 GHz radiometer data. The results obtained from three different methods are compared with radiosonde data. A case study of a tropical depression has been undertaken to demonstrate the capability of Bhaskara SAMIR data to show the variation of total water vapor and liquid water contents.

  17. Statistical Literacy: High School Students in Reading, Interpreting and Presenting Data

    Science.gov (United States)

    Hafiyusholeh, M.; Budayasa, K.; Siswono, T. Y. E.

    2018-01-01

    One of the foundations for high school students in statistics is to be able to read data; presents data in the form of tables and diagrams and its interpretation. The purpose of this study is to describe high school students’ competencies in reading, interpreting and presenting data. Subjects were consisted of male and female students who had high levels of mathematical ability. Collecting data was done in form of task formulation which is analyzed by reducing, presenting and verifying data. Results showed that the students read the data based on explicit explanations on the diagram, such as explaining the points in the diagram as the relation between the x and y axis and determining the simple trend of a graph, including the maximum and minimum point. In interpreting and summarizing the data, both subjects pay attention to general data trends and use them to predict increases or decreases in data. The male estimates the value of the (n+1) of weight data by using the modus of the data, while the females estimate the weigth by using the average. The male tend to do not consider the characteristics of the data, while the female more carefully consider the characteristics of data.

  18. Statistical corruption in Beijing's air quality data has likely ended in 2012

    Science.gov (United States)

    Stoerk, Thomas

    2016-02-01

    This research documents changes in likely misreporting in official air quality data from Beijing for the years 2008-2013. It is shown that, consistent with prior research, the official Chinese data report suspiciously few observations that exceed the politically important Blue Sky Day threshold, a particular air pollution level used to evaluate local officials, and an excess of observations just below that threshold. Similar data, measured by the US Embassy in Beijing, do not show this irregularity. To document likely misreporting, this analysis proposes a new way of comparing air quality data via Benford's Law, a statistical regularity known to fit air pollution data. Using this method to compare the official data to the US Embassy data for the first time, I find that the Chinese data fit Benford's Law poorly until a change in air quality measurements at the end of 2012. From 2013 onwards, the Chinese data fit Benford's Law closely. The US Embassy data, by contrast, exhibit no variation over time in the fit with Benford's Law, implying that the underlying pollution processes remain unchanged. These findings suggest that misreporting of air quality data for Beijing has likely ended in 2012. Additionally, I use aerosol optical density data to show the general applicability of this method of detecting likely misreporting in air pollution data.

  19. Understanding spatial organizations of chromosomes via statistical analysis of Hi-C data

    Science.gov (United States)

    Hu, Ming; Deng, Ke; Qin, Zhaohui; Liu, Jun S.

    2015-01-01

    Understanding how chromosomes fold provides insights into the transcription regulation, hence, the functional state of the cell. Using the next generation sequencing technology, the recently developed Hi-C approach enables a global view of spatial chromatin organization in the nucleus, which substantially expands our knowledge about genome organization and function. However, due to multiple layers of biases, noises and uncertainties buried in the protocol of Hi-C experiments, analyzing and interpreting Hi-C data poses great challenges, and requires novel statistical methods to be developed. This article provides an overview of recent Hi-C studies and their impacts on biomedical research, describes major challenges in statistical analysis of Hi-C data, and discusses some perspectives for future research. PMID:26124977

  20. Statistical analysis of fatigue strain-life data for carbon and low-alloy steels

    International Nuclear Information System (INIS)

    Keisler, J.; Chopra, O.K.

    1995-03-01

    The existing fatigue strain vs life (S-N) data, foreign and domestic, for carbon and low-alloy steels used in the construction of nuclear power plant components have been compiled and categorized according to material, loading, and environmental conditions. A statistical model has been developed for estimating the effects of the various test conditions on fatigue life. The results of a rigorous statistical analysis have been used to estimate the probability of initiating a fatigue crack. Data in the literature were reviewed to evaluate the effects of size, geometry, and surface finish of a component on its fatigue life. The fatigue S-N curves for components have been determined by applying design margins for size, geometry, and surface finish to crack initiation curves estimated from the model

  1. Statistical evaluation of internal contamination data in the man following the Chernobyl accident

    International Nuclear Information System (INIS)

    Tarroni, G.; Battisti, P.; Melandri, C.; Castellani, C.M.; Formignani, M.

    1989-01-01

    The main implications of the general interest derived from the statistical analysis of the internal human contamination data obtained by ENEA-PAS with Whole Body Counter mesurements performed in Bologna in consequence of the Chernobyl accident are presented. In particular the trend with time of the individual body activity of members of a homogeneous group, the variability of individual contamination in ralation to the mean contamination, the statistical distribution of the data, the significance of mean values concerning small, homogeneous groups of subjects, the difference between subjects of different sex and its trend with time, are examined. Finally, the substantial independence of the individual committed dose equivalent evaluation due to the Chernobyl contamination on the Whole from the hypothesized values of the metabolic parameters is pointed out when the evaluation is performed on the basis of direct measurements with a Whole Body Counter

  2. Statistical analyses of the data on occupational radiation expousure at JPDR

    International Nuclear Information System (INIS)

    Kato, Shohei; Anazawa, Yutaka; Matsuno, Kenji; Furuta, Toshishiro; Akiyama, Isamu

    1980-01-01

    In the statistical analyses of the data on occupational radiation exposure at JPDR, statistical features were obtained as follows. (1) The individual doses followed log-normal distribution. (2) In the distribution of doses from one job in controlled area, the logarithm of the mean (μ) depended on the exposure rate (γ(mR/h)), and the σ correlated to the nature of the job and normally distributed. These relations were as follows. μ = 0.48 ln r-0.24, σ = 1.2 +- 0.58 (3) For the data containing different groups, the distribution of doses showed a polygonal line on the log-normal probability paper. (4) Under the dose limitation, the distribution of the doses showed asymptotic curve along the limit on the log-normal probability paper. (author)

  3. Rule-based statistical data mining agents for an e-commerce application

    Science.gov (United States)

    Qin, Yi; Zhang, Yan-Qing; King, K. N.; Sunderraman, Rajshekhar

    2003-03-01

    Intelligent data mining techniques have useful e-Business applications. Because an e-Commerce application is related to multiple domains such as statistical analysis, market competition, price comparison, profit improvement and personal preferences, this paper presents a hybrid knowledge-based e-Commerce system fusing intelligent techniques, statistical data mining, and personal information to enhance QoS (Quality of Service) of e-Commerce. A Web-based e-Commerce application software system, eDVD Web Shopping Center, is successfully implemented uisng Java servlets and an Oracle81 database server. Simulation results have shown that the hybrid intelligent e-Commerce system is able to make smart decisions for different customers.

  4. Reasoning with data an introduction to traditional and Bayesian statistics using R

    CERN Document Server

    Stanton, Jeffrey M

    2017-01-01

    Engaging and accessible, this book teaches readers how to use inferential statistical thinking to check their assumptions, assess evidence about their beliefs, and avoid overinterpreting results that may look more promising than they really are. It provides step-by-step guidance for using both classical (frequentist) and Bayesian approaches to inference. Statistical techniques covered side by side from both frequentist and Bayesian approaches include hypothesis testing, replication, analysis of variance, calculation of effect sizes, regression, time series analysis, and more. Students also get a complete introduction to the open-source R programming language and its key packages. Throughout the text, simple commands in R demonstrate essential data analysis skills using real-data examples. The companion website provides annotated R code for the book's examples, in-class exercises, supplemental reading lists, and links to online videos, interactive materials, and other resources.

  5. Automatic Derivation of Statistical Data Analysis Algorithms: Planetary Nebulae and Beyond

    OpenAIRE

    Fischer, Bernd; Knuth, Kevin; Hajian, Arsen; Schumann, Johann

    2004-01-01

    AUTOBAYES is a fully automatic program synthesis system for the data analysis domain. Its input is a declarative problem description in form of a statistical model; its output is documented and optimized C/C++ code. The synthesis process relies on the combination of three key techniques. Bayesian networks are used as a compact internal representation mechanism which enables problem decompositions and guides the algorithm derivation. Program schemas are used as independently composable buildin...

  6. Statistical test data selection for reliability evalution of process computer software

    International Nuclear Information System (INIS)

    Volkmann, K.P.; Hoermann, H.; Ehrenberger, W.

    1976-01-01

    The paper presents a concept for converting knowledge about the characteristics of process states into practicable procedures for the statistical selection of test cases in testing process computer software. Process states are defined as vectors whose components consist of values of input variables lying in discrete positions or within given limits. Two approaches for test data selection, based on knowledge about cases of demand, are outlined referring to a purely probabilistic method and to the mathematics of stratified sampling. (orig.) [de

  7. The bench scientist's guide to statistical analysis of RNA-Seq data

    OpenAIRE

    Yendrek, Craig R.; Ainsworth, Elizabeth A.; Thimmapuram, Jyothi

    2012-01-01

    Abstract Background RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatics specialists. Here we provide a step-by-step guide and outline a strategy using currently available statistical tools that results in a conservative list of differentially expressed genes. We also discuss potential sources of err...

  8. THE ANALYSIS OF STATISTICAL DATA ON MALIGNANT NEOPLASMS ASSOCIATED WITH HUMAN P APILLOMAVIRUS

    Directory of Open Access Journals (Sweden)

    A. A. Kostin

    2016-01-01

    Full Text Available In this study of statistical data for the first time in Russia the analysis of the morbidity and mortality of patients with malignant neoplasms that may be associated with human papilloma virus (HPV is performed: cervical cancer (cervical cancer, cancer of the vulva and vagina, cancer of penis, cancer of the rectum, anal canal and rectosigmoid junction cancer, cancer of the pharynx and larynx.

  9. Production-distribution of electric power in France: 1997-98 statistical data

    International Nuclear Information System (INIS)

    1999-01-01

    This document has been realized using the annual inquiry carried out by the French direction of gas, electricity and coal (Digec). It brings together the main statistical data about the production, transport and consumption of electric power in France: 1997 and 1998 balance sheets, foreign exchanges, long-term evolutions, production with respect to the different energy sources, consumption in the different departments and regions.. (J.S.)

  10. Imputing historical statistics, soils information, and other land-use data to crop area

    Science.gov (United States)

    Perry, C. R., Jr.; Willis, R. W.; Lautenschlager, L.

    1982-01-01

    In foreign crop condition monitoring, satellite acquired imagery is routinely used. To facilitate interpretation of this imagery, it is advantageous to have estimates of the crop types and their extent for small area units, i.e., grid cells on a map represent, at 60 deg latitude, an area nominally 25 by 25 nautical miles in size. The feasibility of imputing historical crop statistics, soils information, and other ancillary data to crop area for a province in Argentina is studied.

  11. Selecting the right statistical model for analysis of insect count data by using information theoretic measures.

    Science.gov (United States)

    Sileshi, G

    2006-10-01

    Researchers and regulatory agencies often make statistical inferences from insect count data using modelling approaches that assume homogeneous variance. Such models do not allow for formal appraisal of variability which in its different forms is the subject of interest in ecology. Therefore, the objectives of this paper were to (i) compare models suitable for handling variance heterogeneity and (ii) select optimal models to ensure valid statistical inferences from insect count data. The log-normal, standard Poisson, Poisson corrected for overdispersion, zero-inflated Poisson, the negative binomial distribution and zero-inflated negative binomial models were compared using six count datasets on foliage-dwelling insects and five families of soil-dwelling insects. Akaike's and Schwarz Bayesian information criteria were used for comparing the various models. Over 50% of the counts were zeros even in locally abundant species such as Ootheca bennigseni Weise, Mesoplatys ochroptera Stål and Diaecoderus spp. The Poisson model after correction for overdispersion and the standard negative binomial distribution model provided better description of the probability distribution of seven out of the 11 insects than the log-normal, standard Poisson, zero-inflated Poisson or zero-inflated negative binomial models. It is concluded that excess zeros and variance heterogeneity are common data phenomena in insect counts. If not properly modelled, these properties can invalidate the normal distribution assumptions resulting in biased estimation of ecological effects and jeopardizing the integrity of the scientific inferences. Therefore, it is recommended that statistical models appropriate for handling these data properties be selected using objective criteria to ensure efficient statistical inference.

  12. Statistical distribution of time to crack initiation and initial crack size using service data

    Science.gov (United States)

    Heller, R. A.; Yang, J. N.

    1977-01-01

    Crack growth inspection data gathered during the service life of the C-130 Hercules airplane were used in conjunction with a crack propagation rule to estimate the distribution of crack initiation times and of initial crack sizes. A Bayesian statistical approach was used to calculate the fraction of undetected initiation times as a function of the inspection time and the reliability of the inspection procedure used.

  13. Incorporating big data into treatment plan evaluation: Development of statistical DVH metrics and visualization dashboards

    Directory of Open Access Journals (Sweden)

    Charles S. Mayo, PhD

    2017-07-01

    Conclusions: Statistical DVH offers an easy-to-read, detailed, and comprehensive way to visualize the quantitative comparison with historical experiences and among institutions. WES and GEM metrics offer a flexible means of incorporating discrete threshold-prioritizations and historic context into a set of standardized scoring metrics. Together, they provide a practical approach for incorporating big data into clinical practice for treatment plan evaluations.

  14. A Guideline to Univariate Statistical Analysis for LC/MS-Based Untargeted Metabolomics-Derived Data

    Directory of Open Access Journals (Sweden)

    Maria Vinaixa

    2012-10-01

    Full Text Available Several metabolomic software programs provide methods for peak picking, retention time alignment and quantification of metabolite features in LC/MS-based metabolomics. Statistical analysis, however, is needed in order to discover those features significantly altered between samples. By comparing the retention time and MS/MS data of a model compound to that from the altered feature of interest in the research sample, metabolites can be then unequivocally identified. This paper reports on a comprehensive overview of a workflow for statistical analysis to rank relevant metabolite features that will be selected for further MS/MS experiments. We focus on univariate data analysis applied in parallel on all detected features. Characteristics and challenges of this analysis are discussed and illustrated using four different real LC/MS untargeted metabolomic datasets. We demonstrate the influence of considering or violating mathematical assumptions on which univariate statistical test rely, using high-dimensional LC/MS datasets. Issues in data analysis such as determination of sample size, analytical variation, assumption of normality and homocedasticity, or correction for multiple testing are discussed and illustrated in the context of our four untargeted LC/MS working examples.

  15. Landslide susceptibility mapping using GIS-based statistical models and Remote sensing data in tropical environment.

    Science.gov (United States)

    Shahabi, Himan; Hashim, Mazlan

    2015-04-22

    This research presents the results of the GIS-based statistical models for generation of landslide susceptibility mapping using geographic information system (GIS) and remote-sensing data for Cameron Highlands area in Malaysia. Ten factors including slope, aspect, soil, lithology, NDVI, land cover, distance to drainage, precipitation, distance to fault, and distance to road were extracted from SAR data, SPOT 5 and WorldView-1 images. The relationships between the detected landslide locations and these ten related factors were identified by using GIS-based statistical models including analytical hierarchy process (AHP), weighted linear combination (WLC) and spatial multi-criteria evaluation (SMCE) models. The landslide inventory map which has a total of 92 landslide locations was created based on numerous resources such as digital aerial photographs, AIRSAR data, WorldView-1 images, and field surveys. Then, 80% of the landslide inventory was used for training the statistical models and the remaining 20% was used for validation purpose. The validation results using the Relative landslide density index (R-index) and Receiver operating characteristic (ROC) demonstrated that the SMCE model (accuracy is 96%) is better in prediction than AHP (accuracy is 91%) and WLC (accuracy is 89%) models. These landslide susceptibility maps would be useful for hazard mitigation purpose and regional planning.

  16. Quantile regression for the statistical analysis of immunological data with many non-detects.

    Science.gov (United States)

    Eilers, Paul H C; Röder, Esther; Savelkoul, Huub F J; van Wijk, Roy Gerth

    2012-07-07

    Immunological parameters are hard to measure. A well-known problem is the occurrence of values below the detection limit, the non-detects. Non-detects are a nuisance, because classical statistical analyses, like ANOVA and regression, cannot be applied. The more advanced statistical techniques currently available for the analysis of datasets with non-detects can only be used if a small percentage of the data are non-detects. Quantile regression, a generalization of percentiles to regression models, models the median or higher percentiles and tolerates very high numbers of non-detects. We present a non-technical introduction and illustrate it with an implementation to real data from a clinical trial. We show that by using quantile regression, groups can be compared and that meaningful linear trends can be computed, even if more than half of the data consists of non-detects. Quantile regression is a valuable addition to the statistical methods that can be used for the analysis of immunological datasets with non-detects.

  17. Multivariate two-part statistics for analysis of correlated mass spectrometry data from multiple biological specimens.

    Science.gov (United States)

    Taylor, Sandra L; Ruhaak, L Renee; Weiss, Robert H; Kelly, Karen; Kim, Kyoungmi

    2017-01-01

    High through-put mass spectrometry (MS) is now being used to profile small molecular compounds across multiple biological sample types from the same subjects with the goal of leveraging information across biospecimens. Multivariate statistical methods that combine information from all biospecimens could be more powerful than the usual univariate analyses. However, missing values are common in MS data and imputation can impact between-biospecimen correlation and multivariate analysis results. We propose two multivariate two-part statistics that accommodate missing values and combine data from all biospecimens to identify differentially regulated compounds. Statistical significance is determined using a multivariate permutation null distribution. Relative to univariate tests, the multivariate procedures detected more significant compounds in three biological datasets. In a simulation study, we showed that multi-biospecimen testing procedures were more powerful than single-biospecimen methods when compounds are differentially regulated in multiple biospecimens but univariate methods can be more powerful if compounds are differentially regulated in only one biospecimen. We provide R functions to implement and illustrate our method as supplementary information CONTACT: sltaylor@ucdavis.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  18. Statistical Analysis of Reactor Pressure Vessel Fluence Calculation Benchmark Data Using Multiple Regression Techniques

    International Nuclear Information System (INIS)

    Carew, John F.; Finch, Stephen J.; Lois, Lambros

    2003-01-01

    The calculated >1-MeV pressure vessel fluence is used to determine the fracture toughness and integrity of the reactor pressure vessel. It is therefore of the utmost importance to ensure that the fluence prediction is accurate and unbiased. In practice, this assurance is provided by comparing the predictions of the calculational methodology with an extensive set of accurate benchmarks. A benchmarking database is used to provide an estimate of the overall average measurement-to-calculation (M/C) bias in the calculations ( ). This average is used as an ad-hoc multiplicative adjustment to the calculations to correct for the observed calculational bias. However, this average only provides a well-defined and valid adjustment of the fluence if the M/C data are homogeneous; i.e., the data are statistically independent and there is no correlation between subsets of M/C data.Typically, the identification of correlations between the errors in the database M/C values is difficult because the correlation is of the same magnitude as the random errors in the M/C data and varies substantially over the database. In this paper, an evaluation of a reactor dosimetry benchmark database is performed to determine the statistical validity of the adjustment to the calculated pressure vessel fluence. Physical mechanisms that could potentially introduce a correlation between the subsets of M/C ratios are identified and included in a multiple regression analysis of the M/C data. Rigorous statistical criteria are used to evaluate the homogeneity of the M/C data and determine the validity of the adjustment.For the database evaluated, the M/C data are found to be strongly correlated with dosimeter response threshold energy and dosimeter location (e.g., cavity versus in-vessel). It is shown that because of the inhomogeneity in the M/C data, for this database, the benchmark data do not provide a valid basis for adjusting the pressure vessel fluence.The statistical criteria and methods employed in

  19. Derivation from first principles of the statistical distribution of the mass peak intensities of MS data.

    Science.gov (United States)

    Ipsen, Andreas

    2015-02-03

    Despite the widespread use of mass spectrometry (MS) in a broad range of disciplines, the nature of MS data remains very poorly understood, and this places important constraints on the quality of MS data analysis as well as on the effectiveness of MS instrument design. In the following, a procedure for calculating the statistical distribution of the mass peak intensity for MS instruments that use analog-to-digital converters (ADCs) and electron multipliers is presented. It is demonstrated that the physical processes underlying the data-generation process, from the generation of the ions to the signal induced at the detector, and on to the digitization of the resulting voltage pulse, result in data that can be well-approximated by a Gaussian distribution whose mean and variance are determined by physically meaningful instrumental parameters. This allows for a very precise understanding of the signal-to-noise ratio of mass peak intensities and suggests novel ways of improving it. Moreover, it is a prerequisite for being able to address virtually all data analytical problems in downstream analyses in a statistically rigorous manner. The model is validated with experimental data.

  20. Architecture of a spatial data service system for statistical analysis and visualization of regional climate changes

    Science.gov (United States)

    Titov, A. G.; Okladnikov, I. G.; Gordov, E. P.

    2017-11-01

    The use of large geospatial datasets in climate change studies requires the development of a set of Spatial Data Infrastructure (SDI) elements, including geoprocessing and cartographical visualization web services. This paper presents the architecture of a geospatial OGC web service system as an integral part of a virtual research environment (VRE) general architecture for statistical processing and visualization of meteorological and climatic data. The architecture is a set of interconnected standalone SDI nodes with corresponding data storage systems. Each node runs a specialized software, such as a geoportal, cartographical web services (WMS/WFS), a metadata catalog, and a MySQL database of technical metadata describing geospatial datasets available for the node. It also contains geospatial data processing services (WPS) based on a modular computing backend realizing statistical processing functionality and, thus, providing analysis of large datasets with the results of visualization and export into files of standard formats (XML, binary, etc.). Some cartographical web services have been developed in a system’s prototype to provide capabilities to work with raster and vector geospatial data based on OGC web services. The distributed architecture presented allows easy addition of new nodes, computing and data storage systems, and provides a solid computational infrastructure for regional climate change studies based on modern Web and GIS technologies.