Bianca N.I. Eskelson; Hailemariam Temesgen; Tara M. Barrett
2009-01-01
Cavity tree and snag abundance data are highly variable and contain many zero observations. We predict cavity tree and snag abundance from variables that are readily available from forest cover maps or remotely sensed data using negative binomial (NB), zero-inflated NB, and zero-altered NB (ZANB) regression models as well as nearest neighbor (NN) imputation methods....
Eekhout, I.; Wiel, M.A. van de; Heymans, M.W.
2017-01-01
Background. Multiple imputation is a recommended method to handle missing data. For significance testing after multiple imputation, Rubin’s Rules (RR) are easily applied to pool parameter estimates. In a logistic regression model, to consider whether a categorical covariate with more than two levels
Chen, Qingxia; Ibrahim, Joseph G
2014-07-01
Multiple Imputation, Maximum Likelihood and Fully Bayesian methods are the three most commonly used model-based approaches in missing data problems. Although it is easy to show that when the responses are missing at random (MAR), the complete case analysis is unbiased and efficient, the aforementioned methods are still commonly used in practice for this setting. To examine the performance of and relationships between these three methods in this setting, we derive and investigate small sample and asymptotic expressions of the estimates and standard errors, and fully examine how these estimates are related for the three approaches in the linear regression model when the responses are MAR. We show that when the responses are MAR in the linear model, the estimates of the regression coefficients using these three methods are asymptotically equivalent to the complete case estimates under general conditions. One simulation and a real data set from a liver cancer clinical trial are given to compare the properties of these methods when the responses are MAR.
R package imputeTestbench to compare imputations methods for univariate time series
Bokde, Neeraj; Kulat, Kishore; Beck, Marcus W; Asencio-Cortés, Gualberto
2016-01-01
This paper describes the R package imputeTestbench that provides a testbench for comparing imputation methods for missing data in univariate time series. The imputeTestbench package can be used to simulate the amount and type of missing data in a complete dataset and compare filled data using different imputation methods. The user has the option to simulate missing data by removing observations completely at random or in blocks of different sizes. Several default imputation methods are includ...
Multiple Imputation of a Randomly Censored Covariate Improves Logistic Regression Analysis.
Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A
2016-01-01
Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.
The multiple imputation method: a case study involving secondary data analysis.
Walani, Salimah R; Cleland, Charles M
2015-05-01
To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.
Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.
2017-06-01
The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.
Chaurasia, Ashok; Harel, Ofer
2015-02-10
Tests for regression coefficients such as global, local, and partial F-tests are common in applied research. In the framework of multiple imputation, there are several papers addressing tests for regression coefficients. However, for simultaneous hypothesis testing, the existing methods are computationally intensive because they involve calculation with vectors and (inversion of) matrices. In this paper, we propose a simple method based on the scalar entity, coefficient of determination, to perform (global, local, and partial) F-tests with multiply imputed data. The proposed method is evaluated using simulated data and applied to suicide prevention data. Copyright © 2014 John Wiley & Sons, Ltd.
Imputation methods for filling missing data in urban air pollution data for Malaysia
Directory of Open Access Journals (Sweden)
Nur Afiqah Zakaria
2018-06-01
Full Text Available The air quality measurement data obtained from the continuous ambient air quality monitoring (CAAQM station usually contained missing data. The missing observations of the data usually occurred due to machine failure, routine maintenance and human error. In this study, the hourly monitoring data of CO, O3, PM10, SO2, NOx, NO2, ambient temperature and humidity were used to evaluate four imputation methods (Mean Top Bottom, Linear Regression, Multiple Imputation and Nearest Neighbour. The air pollutants observations were simulated into four percentages of simulated missing data i.e. 5%, 10%, 15% and 20%. Performance measures namely the Mean Absolute Error, Root Mean Squared Error, Coefficient of Determination and Index of Agreement were used to describe the goodness of fit of the imputation methods. From the results of the performance measures, Mean Top Bottom method was selected as the most appropriate imputation method for filling in the missing values in air pollutants data.
Directory of Open Access Journals (Sweden)
Hardt Jochen
2012-12-01
Full Text Available Abstract Background Multiple imputation is becoming increasingly popular. Theoretical considerations as well as simulation studies have shown that the inclusion of auxiliary variables is generally of benefit. Methods A simulation study of a linear regression with a response Y and two predictors X1 and X2 was performed on data with n = 50, 100 and 200 using complete cases or multiple imputation with 0, 10, 20, 40 and 80 auxiliary variables. Mechanisms of missingness were either 100% MCAR or 50% MAR + 50% MCAR. Auxiliary variables had low (r=.10 vs. moderate correlations (r=.50 with X’s and Y. Results The inclusion of auxiliary variables can improve a multiple imputation model. However, inclusion of too many variables leads to downward bias of regression coefficients and decreases precision. When the correlations are low, inclusion of auxiliary variables is not useful. Conclusion More research on auxiliary variables in multiple imputation should be performed. A preliminary rule of thumb could be that the ratio of variables to cases with complete data should not go below 1 : 3.
Dealing with missing data in a multi-question depression scale: a comparison of imputation methods
Directory of Open Access Journals (Sweden)
Stuart Heather
2006-12-01
Full Text Available Abstract Background Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS. Methods 1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation. Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1 multiple imputation, 2 single regression, 3 individual mean, 4 overall mean, 5 participant's preceding response, and 6 random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated. Results When 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89, although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range
Missing data imputation: focusing on single imputation.
Zhang, Zhongheng
2016-01-01
Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations.
Poyatos, Rafael; Sus, Oliver; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
2018-05-01
The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10-80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in
Directory of Open Access Journals (Sweden)
R. Poyatos
2018-05-01
Full Text Available The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2. We simulated gaps at different missingness levels (10–80 % in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN, ordinary and regression kriging, and multivariate imputation using chained equations (MICE to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %, species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables
Which DTW Method Applied to Marine Univariate Time Series Imputation
Phan , Thi-Thu-Hong; Caillault , Émilie; Lefebvre , Alain; Bigand , André
2017-01-01
International audience; Missing data are ubiquitous in any domains of applied sciences. Processing datasets containing missing values can lead to a loss of efficiency and unreliable results, especially for large missing sub-sequence(s). Therefore, the aim of this paper is to build a framework for filling missing values in univariate time series and to perform a comparison of different similarity metrics used for the imputation task. This allows to suggest the most suitable methods for the imp...
Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi
2016-06-21
Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. Copyright © 2016 Elsevier Ltd. All rights reserved.
Alternative Methods of Regression
Birkes, David
2011-01-01
Of related interest. Nonlinear Regression Analysis and its Applications Douglas M. Bates and Donald G. Watts ".an extraordinary presentation of concepts and methods concerning the use and analysis of nonlinear regression models.highly recommend[ed].for anyone needing to use and/or understand issues concerning the analysis of nonlinear regression models." --Technometrics This book provides a balance between theory and practice supported by extensive displays of instructive geometrical constructs. Numerous in-depth case studies illustrate the use of nonlinear regression analysis--with all data s
The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer
Directory of Open Access Journals (Sweden)
Rosa Aghdam
2017-12-01
Full Text Available Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/.
Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz
2017-12-01
Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.
Traffic Speed Data Imputation Method Based on Tensor Completion
Directory of Open Access Journals (Sweden)
Bin Ran
2015-01-01
Full Text Available Traffic speed data plays a key role in Intelligent Transportation Systems (ITS; however, missing traffic data would affect the performance of ITS as well as Advanced Traveler Information Systems (ATIS. In this paper, we handle this issue by a novel tensor-based imputation approach. Specifically, tensor pattern is adopted for modeling traffic speed data and then High accurate Low Rank Tensor Completion (HaLRTC, an efficient tensor completion method, is employed to estimate the missing traffic speed data. This proposed method is able to recover missing entries from given entries, which may be noisy, considering severe fluctuation of traffic speed data compared with traffic volume. The proposed method is evaluated on Performance Measurement System (PeMS database, and the experimental results show the superiority of the proposed approach over state-of-the-art baseline approaches.
Traffic speed data imputation method based on tensor completion.
Ran, Bin; Tan, Huachun; Feng, Jianshuai; Liu, Ying; Wang, Wuhong
2015-01-01
Traffic speed data plays a key role in Intelligent Transportation Systems (ITS); however, missing traffic data would affect the performance of ITS as well as Advanced Traveler Information Systems (ATIS). In this paper, we handle this issue by a novel tensor-based imputation approach. Specifically, tensor pattern is adopted for modeling traffic speed data and then High accurate Low Rank Tensor Completion (HaLRTC), an efficient tensor completion method, is employed to estimate the missing traffic speed data. This proposed method is able to recover missing entries from given entries, which may be noisy, considering severe fluctuation of traffic speed data compared with traffic volume. The proposed method is evaluated on Performance Measurement System (PeMS) database, and the experimental results show the superiority of the proposed approach over state-of-the-art baseline approaches.
Jerez, José M; Molina, Ignacio; García-Laencina, Pedro J; Alba, Emilio; Ribelles, Nuria; Martín, Miguel; Franco, Leonardo
2010-10-01
Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures. Copyright © 2010 Elsevier B.V. All rights reserved.
An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.
Liu, Yuzhe; Gopalakrishnan, Vanathi
2017-03-01
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
DEFF Research Database (Denmark)
Fitzenberger, Bernd; Wilke, Ralf Andreas
2015-01-01
if the mean regression model does not. We provide a short informal introduction into the principle of quantile regression which includes an illustrative application from empirical labor market research. This is followed by briefly sketching the underlying statistical model for linear quantile regression based......Quantile regression is emerging as a popular statistical approach, which complements the estimation of conditional mean models. While the latter only focuses on one aspect of the conditional distribution of the dependent variable, the mean, quantile regression provides more detailed insights...... by modeling conditional quantiles. Quantile regression can therefore detect whether the partial effect of a regressor on the conditional quantiles is the same for all quantiles or differs across quantiles. Quantile regression can provide evidence for a statistical relationship between two variables even...
Directory of Open Access Journals (Sweden)
Abbas Mikhchi
2016-01-01
Full Text Available Abstract Background Genotype imputation is an important process of predicting unknown genotypes, which uses reference population with dense genotypes to predict missing genotypes for both human and animal genetic variations at a low cost. Machine learning methods specially boosting methods have been used in genetic studies to explore the underlying genetic profile of disease and build models capable of predicting missing values of a marker. Methods In this study strategies and factors affecting the imputation accuracy of parent-offspring trios compared from lower-density SNP panels (5 K to high density (10 K SNP panel using three different Boosting methods namely TotalBoost (TB, LogitBoost (LB and AdaBoost (AB. The methods employed using simulated data to impute the un-typed SNPs in parent-offspring trios. Four different datasets of G1 (100 trios with 5 k SNPs, G2 (100 trios with 10 k SNPs, G3 (500 trios with 5 k SNPs, and G4 (500 trio with 10 k SNPs were simulated. In four datasets all parents were genotyped completely, and offspring genotyped with a lower density panel. Results Comparison of the three methods for imputation showed that the LB outperformed AB and TB for imputation accuracy. The time of computation were different between methods. The AB was the fastest algorithm. The higher SNP densities resulted the increase of the accuracy of imputation. Larger trios (i.e. 500 was better for performance of LB and TB. Conclusions The conclusion is that the three methods do well in terms of imputation accuracy also the dense chip is recommended for imputation of parent-offspring trios.
Yozgatligil, Ceylan; Aslan, Sipan; Iyigun, Cem; Batmaz, Inci
2013-04-01
This study aims to compare several imputation methods to complete the missing values of spatio-temporal meteorological time series. To this end, six imputation methods are assessed with respect to various criteria including accuracy, robustness, precision, and efficiency for artificially created missing data in monthly total precipitation and mean temperature series obtained from the Turkish State Meteorological Service. Of these methods, simple arithmetic average, normal ratio (NR), and NR weighted with correlations comprise the simple ones, whereas multilayer perceptron type neural network and multiple imputation strategy adopted by Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) are computationally intensive ones. In addition, we propose a modification on the EM-MCMC method. Besides using a conventional accuracy measure based on squared errors, we also suggest the correlation dimension (CD) technique of nonlinear dynamic time series analysis which takes spatio-temporal dependencies into account for evaluating imputation performances. Depending on the detailed graphical and quantitative analysis, it can be said that although computational methods, particularly EM-MCMC method, are computationally inefficient, they seem favorable for imputation of meteorological time series with respect to different missingness periods considering both measures and both series studied. To conclude, using the EM-MCMC algorithm for imputing missing values before conducting any statistical analyses of meteorological data will definitely decrease the amount of uncertainty and give more robust results. Moreover, the CD measure can be suggested for the performance evaluation of missing data imputation particularly with computational methods since it gives more precise results in meteorological time series.
DEFF Research Database (Denmark)
Ma, Peipei; Brøndum, Rasmus Froberg; Qin, Zahng
2013-01-01
This study investigated the imputation accuracy of different methods, considering both the minor allele frequency and relatedness between individuals in the reference and test data sets. Two data sets from the combined population of Swedish and Finnish Red Cattle were used to test the influence...... coefficient was lower when the minor allele frequency was lower. The results indicate that Beagle and IMPUTE2 provide the most robust and accurate imputation accuracies, but considering computing time and memory usage, FImpute is another alternative method....
Fish, Laurel J.; Halcoussis, Dennis; Phillips, G. Michael
2017-01-01
The Monte Carlo method and related multiple imputation methods are traditionally used in math, physics and science to estimate and analyze data and are now becoming standard tools in analyzing business and financial problems. However, few sources explain the application of the Monte Carlo method for individuals and business professionals who are…
Missing value imputation in DNA microarrays based on conjugate gradient method.
Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh
2012-02-01
Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.
Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A
2015-12-01
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3-40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31-0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04-0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.
Regression methods for medical research
Tai, Bee Choo
2013-01-01
Regression Methods for Medical Research provides medical researchers with the skills they need to critically read and interpret research using more advanced statistical methods. The statistical requirements of interpreting and publishing in medical journals, together with rapid changes in science and technology, increasingly demands an understanding of more complex and sophisticated analytic procedures.The text explains the application of statistical models to a wide variety of practical medical investigative studies and clinical trials. Regression methods are used to appropriately answer the
Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
2016-04-01
Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix
Comparison of different Methods for Univariate Time Series Imputation in R
Moritz, Steffen; Sardá, Alexis; Bartz-Beielstein, Thomas; Zaefferer, Martin; Stork, Jörg
2015-01-01
Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of ...
Multiple Imputation of Predictor Variables Using Generalized Additive Models
de Jong, Roel; van Buuren, Stef; Spiess, Martin
2016-01-01
The sensitivity of multiple imputation methods to deviations from their distributional assumptions is investigated using simulations, where the parameters of scientific interest are the coefficients of a linear regression model, and values in predictor variables are missing at random. The
Peterson, Josh F.; Eden, Svetlana K.; Moons, Karel G.; Ikizler, T. Alp; Matheny, Michael E.
2013-01-01
Summary Background and objectives Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m2 (eGFR 75). Design, setting, participants, & measurements From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a “missing” data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. Results All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; Pcreatinine) (15.3%) versus eGFR 75 (40.5%; P<0.001). Multiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Conclusions Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods. PMID:23037980
Directory of Open Access Journals (Sweden)
Nawar Shara
Full Text Available Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS. Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991, 2 (1993-1995, and 3 (1998-1999 was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.
Directory of Open Access Journals (Sweden)
Lotz Meredith J
2008-01-01
Full Text Available Abstract Background Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. Results We found that the optimal imputation algorithms (LSA, LLS, and BPCA are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Conclusion Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA
Brock, Guy N; Shaffer, John R; Blakesley, Richard E; Lotz, Meredith J; Tseng, George C
2008-01-10
Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity
A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets
Carrig, Madeline M.; Manrique-Vallier, Daniel; Ranby, Krista W.; Reiter, Jerome P.; Hoyle, Rick H.
2015-01-01
Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches. PMID:26257437
A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method.
Yang, Jun-He; Cheng, Ching-Hsue; Chan, Chia-Pan
2017-01-01
Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir's water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.
A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method
Directory of Open Access Journals (Sweden)
Jun-He Yang
2017-01-01
Full Text Available Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir’s water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir’s water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.
Baker, Jannah; White, Nicole; Mengersen, Kerrie
2014-11-20
Spatial analysis is increasingly important for identifying modifiable geographic risk factors for disease. However, spatial health data from surveys are often incomplete, ranging from missing data for only a few variables, to missing data for many variables. For spatial analyses of health outcomes, selection of an appropriate imputation method is critical in order to produce the most accurate inferences. We present a cross-validation approach to select between three imputation methods for health survey data with correlated lifestyle covariates, using as a case study, type II diabetes mellitus (DM II) risk across 71 Queensland Local Government Areas (LGAs). We compare the accuracy of mean imputation to imputation using multivariate normal and conditional autoregressive prior distributions. Choice of imputation method depends upon the application and is not necessarily the most complex method. Mean imputation was selected as the most accurate method in this application. Selecting an appropriate imputation method for health survey data, after accounting for spatial correlation and correlation between covariates, allows more complete analysis of geographic risk factors for disease with more confidence in the results to inform public policy decision-making.
BRITS: Bidirectional Recurrent Imputation for Time Series
Cao, Wei; Wang, Dong; Li, Jian; Zhou, Hao; Li, Lei; Li, Yitan
2018-01-01
Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing va...
Energy Technology Data Exchange (ETDEWEB)
Riggi, S., E-mail: sriggi@oact.inaf.it [INAF - Osservatorio Astrofisico di Catania (Italy); Riggi, D. [Keras Strategy - Milano (Italy); Riggi, F. [Dipartimento di Fisica e Astronomia - Università di Catania (Italy); INFN, Sezione di Catania (Italy)
2015-04-21
Identification of charged particles in a multilayer detector by the energy loss technique may also be achieved by the use of a neural network. The performance of the network becomes worse when a large fraction of information is missing, for instance due to detector inefficiencies. Algorithms which provide a way to impute missing information have been developed over the past years. Among the various approaches, we focused on normal mixtures’ models in comparison with standard mean imputation and multiple imputation methods. Further, to account for the intrinsic asymmetry of the energy loss data, we considered skew-normal mixture models and provided a closed form implementation in the Expectation-Maximization (EM) algorithm framework to handle missing patterns. The method has been applied to a test case where the energy losses of pions, kaons and protons in a six-layers’ Silicon detector are considered as input neurons to a neural network. Results are given in terms of reconstruction efficiency and purity of the various species in different momentum bins.
MacNeil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G; van Hout, Hein; de Rooij, Sophia E; Heymans, Martijn W; Bosmans, Judith E
2016-01-01
Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing
Regression modeling methods, theory, and computation with SAS
Panik, Michael
2009-01-01
Regression Modeling: Methods, Theory, and Computation with SAS provides an introduction to a diverse assortment of regression techniques using SAS to solve a wide variety of regression problems. The author fully documents the SAS programs and thoroughly explains the output produced by the programs.The text presents the popular ordinary least squares (OLS) approach before introducing many alternative regression methods. It covers nonparametric regression, logistic regression (including Poisson regression), Bayesian regression, robust regression, fuzzy regression, random coefficients regression,
Evaluation and application of summary statistic imputation to discover new height-associated loci.
Rüeger, Sina; McDaid, Aaron; Kutalik, Zoltán
2018-05-01
As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian
Guo, Wei-Li; Huang, De-Shuang
2017-08-22
Transcription factors (TFs) are DNA-binding proteins that have a central role in regulating gene expression. Identification of DNA-binding sites of TFs is a key task in understanding transcriptional regulation, cellular processes and disease. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) enables genome-wide identification of in vivo TF binding sites. However, it is still difficult to map every TF in every cell line owing to cost and biological material availability, which poses an enormous obstacle for integrated analysis of gene regulation. To address this problem, we propose a novel computational approach, TFBSImpute, for predicting additional TF binding profiles by leveraging information from available ChIP-seq TF binding data. TFBSImpute fuses the dataset to a 3-mode tensor and imputes missing TF binding signals via simultaneous completion of multiple TF binding matrices with positional consistency. We show that signals predicted by our method achieve overall similarity with experimental data and that TFBSImpute significantly outperforms baseline approaches, by assessing the performance of imputation methods against observed ChIP-seq TF binding profiles. Besides, motif analysis shows that TFBSImpute preforms better in capturing binding motifs enriched in observed data compared with baselines, indicating that the higher performance of TFBSImpute is not simply due to averaging related samples. We anticipate that our approach will constitute a useful complement to experimental mapping of TF binding, which is beneficial for further study of regulation mechanisms and disease.
Stochastic development regression using method of moments
DEFF Research Database (Denmark)
Kühnel, Line; Sommer, Stefan Horst
2017-01-01
This paper considers the estimation problem arising when inferring parameters in the stochastic development regression model for manifold valued non-linear data. Stochastic development regression captures the relation between manifold-valued response and Euclidean covariate variables using...... the stochastic development construction. It is thereby able to incorporate several covariate variables and random effects. The model is intrinsically defined using the connection of the manifold, and the use of stochastic development avoids linearizing the geometry. We propose to infer parameters using...... the Method of Moments procedure that matches known constraints on moments of the observations conditional on the latent variables. The performance of the model is investigated in a simulation example using data on finite dimensional landmark manifolds....
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS Data.
Directory of Open Access Journals (Sweden)
Ariel W Chan
Full Text Available Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS methods, such as Genotyping-By-Sequencing (GBS, offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1 can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2 are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'. We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
Chan, Ariel W; Hamblin, Martha T; Jannink, Jean-Luc
2016-01-01
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the
Directory of Open Access Journals (Sweden)
Katya L Masconi
Full Text Available Imputation techniques used to handle missing data are based on the principle of replacement. It is widely advocated that multiple imputation is superior to other imputation methods, however studies have suggested that simple methods for filling missing data can be just as accurate as complex methods. The objective of this study was to implement a number of simple and more complex imputation methods, and assess the effect of these techniques on the performance of undiagnosed diabetes risk prediction models during external validation.Data from the Cape Town Bellville-South cohort served as the basis for this study. Imputation methods and models were identified via recent systematic reviews. Models' discrimination was assessed and compared using C-statistic and non-parametric methods, before and after recalibration through simple intercept adjustment.The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4% had missing data. Family history had the highest proportion of missing data (25%. Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals. Overall, deletion resulted in the lowest model performances while simple imputation yielded the highest C-statistic for the Cambridge Diabetes Risk model, Kuwaiti Risk model, Omani Diabetes Risk model and Rotterdam Predictive model. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model, which were matched by simpler imputation methods.Deletion was confirmed as a poor technique for handling missing data. However, despite the emphasized disadvantages of simpler imputation methods, this study showed that implementing these methods results in similar predictive utility for undiagnosed diabetes when compared to multiple imputation.
Directory of Open Access Journals (Sweden)
Torres Munguía, Juan Armando
2014-06-01
Full Text Available This paper examines the sample proportions estimates in the presence of univariate missing categorical data. A database about smoking habits (2011 National Addiction Survey of Mexico was used to create simulated yet realistic datasets at rates 5% and 15% of missingness, each for MCAR, MAR and MNAR mechanisms. Then the performance of six methods for addressing missingness is evaluated: listwise, mode imputation, random imputation, hot-deck, imputation by polytomous regression and random forests. Results showed that the most effective methods for dealing with missing categorical data in most of the scenarios assessed in this paper were hot-deck and polytomous regression approaches. || El presente estudio examina la estimación de proporciones muestrales en la presencia de valores faltantes en una variable categórica. Se utiliza una encuesta de consumo de tabaco (Encuesta Nacional de Adicciones de México 2011 para crear bases de datos simuladas pero reales con 5% y 15% de valores perdidos para cada mecanismo de no respuesta MCAR, MAR y MNAR. Se evalúa el desempeño de seis métodos para tratar la falta de respuesta: listwise, imputación de moda, imputación aleatoria, hot-deck, imputación por regresión politómica y árboles de clasificación. Los resultados de las simulaciones indican que los métodos más efectivos para el tratamiento de la no respuesta en variables categóricas, bajo los escenarios simulados, son hot-deck y la regresión politómica.
Multiple imputation and its application
Carpenter, James
2013-01-01
A practical guide to analysing partially observed data. Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods. This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures. Multiple Imputation and its Application: Discusses the issues ...
Method for nonlinear exponential regression analysis
Junkin, B. G.
1972-01-01
Two computer programs developed according to two general types of exponential models for conducting nonlinear exponential regression analysis are described. Least squares procedure is used in which the nonlinear problem is linearized by expanding in a Taylor series. Program is written in FORTRAN 5 for the Univac 1108 computer.
A method for nonlinear exponential regression analysis
Junkin, B. G.
1971-01-01
A computer-oriented technique is presented for performing a nonlinear exponential regression analysis on decay-type experimental data. The technique involves the least squares procedure wherein the nonlinear problem is linearized by expansion in a Taylor series. A linear curve fitting procedure for determining the initial nominal estimates for the unknown exponential model parameters is included as an integral part of the technique. A correction matrix was derived and then applied to the nominal estimate to produce an improved set of model parameters. The solution cycle is repeated until some predetermined criterion is satisfied.
Kenneth B. Pierce; Janet L. Ohmann; Michael C. Wimberly; Matthew J. Gregory; Jeremy S. Fried
2009-01-01
Land managers need consistent information about the geographic distribution of wildland fuels and forest structure over large areas to evaluate fire risk and plan fuel treatments. We compared spatial predictions for 12 fuel and forest structure variables across three regions in the western United States using gradient nearest neighbor (GNN) imputation, linear models (...
Thomas, A M; Cook, L J; Dean, J M; Olson, L M
2014-01-01
To compare results from high probability matched sets versus imputed matched sets across differing levels of linkage information. A series of linkages with varying amounts of available information were performed on two simulated datasets derived from multiyear motor vehicle crash (MVC) and hospital databases, where true matches were known. Distributions of high probability and imputed matched sets were compared against the true match population for occupant age, MVC county, and MVC hour. Regression models were fit to simulated log hospital charges and hospitalization status. High probability and imputed matched sets were not significantly different from occupant age, MVC county, and MVC hour in high information settings (p > 0.999). In low information settings, high probability matched sets were significantly different from occupant age and MVC county (p sets were not (p > 0.493). High information settings saw no significant differences in inference of simulated log hospital charges and hospitalization status between the two methods. High probability and imputed matched sets were significantly different from the outcomes in low information settings; however, imputed matched sets were more robust. The level of information available to a linkage is an important consideration. High probability matched sets are suitable for high to moderate information settings and for situations involving case-specific analysis. Conversely, imputed matched sets are preferable for low information settings when conducting population-based analyses.
Directory of Open Access Journals (Sweden)
Duygu KOÇAK
2017-11-01
Full Text Available The study aims to identify the effects of iteration numbers used in multiple iteration method, one of the methods used to cope with missing values, on the results of factor analysis. With this aim, artificial datasets of different sample sizes were created. Missing values at random and missing values at complete random were created in various ratios by deleting data. For the data in random missing values, a second variable was iterated at ordinal scale level and datasets with different ratios of missing values were obtained based on the levels of this variable. The data were generated using “psych” program in R software, while “dplyr” program was used to create codes that would delete values according to predetermined conditions of missing value mechanism. Different datasets were generated by applying different iteration numbers. Explanatory factor analysis was conducted on the datasets completed and the factors and total explained variances are presented. These values were first evaluated based on the number of factors and total variance explained of the complete datasets. The results indicate that multiple iteration method yields a better performance in cases of missing values at random compared to datasets with missing values at complete random. Also, it was found that increasing the number of iterations in both missing value datasets decreases the difference in the results obtained from complete datasets.
A nonparametric multiple imputation approach for missing categorical data
Directory of Open Access Journals (Sweden)
Muhan Zhou
2017-06-01
Full Text Available Abstract Background Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness probabilities. Methods We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model and the other fits a logistic regression for predicting missingness probabilities (the missingness model. A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with
Mistler, Stephen A.; Enders, Craig K.
2017-01-01
Multiple imputation methods can generally be divided into two broad frameworks: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution, whereas FCS imputes variables one at a time from a series of univariate conditional…
Energy Technology Data Exchange (ETDEWEB)
Ferreira, A. M.
2004-07-01
This study concerns about different methods for analysing and imputation missing values in wind speed series. The algorithm EM and a methodology derivated from the sequential hot deck have been utilized. Series with missing values imputed are compared with original and complete series, using several criteria, such the wind potential; and appears to exist a significant goodness of fit between the estimates and real values. (Author)
A New Missing Data Imputation Algorithm Applied to Electrical Data Loggers
Directory of Open Access Journals (Sweden)
Concepción Crespo Turrado
2015-12-01
Full Text Available Nowadays, data collection is a key process in the study of electrical power networks when searching for harmonics and a lack of balance among phases. In this context, the lack of data of any of the main electrical variables (phase-to-neutral voltage, phase-to-phase voltage, and current in each phase and power factor adversely affects any time series study performed. When this occurs, a data imputation process must be accomplished in order to substitute the data that is missing for estimated values. This paper presents a novel missing data imputation method based on multivariate adaptive regression splines (MARS and compares it with the well-known technique called multivariate imputation by chained equations (MICE. The results obtained demonstrate how the proposed method outperforms the MICE algorithm.
Gaussian mixture clustering and imputation of microarray data.
Ouyang, Ming; Welsh, William J; Georgopoulos, Panos
2004-04-12
In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.
Ridge regression estimator: combining unbiased and ordinary ridge regression methods of estimation
Directory of Open Access Journals (Sweden)
Sharad Damodar Gore
2009-10-01
Full Text Available Statistical literature has several methods for coping with multicollinearity. This paper introduces a new shrinkage estimator, called modified unbiased ridge (MUR. This estimator is obtained from unbiased ridge regression (URR in the same way that ordinary ridge regression (ORR is obtained from ordinary least squares (OLS. Properties of MUR are derived. Results on its matrix mean squared error (MMSE are obtained. MUR is compared with ORR and URR in terms of MMSE. These results are illustrated with an example based on data generated by Hoerl and Kennard (1975.
A multiple regression method for genomewide association studies ...
Indian Academy of Sciences (India)
Bujun Mei
2018-06-07
Jun 7, 2018 ... Similar to the typical genomewide association tests using LD ... new approach performed validly when the multiple regression based on linkage method was employed. .... the model, two groups of scenarios were simulated.
BOX-COX REGRESSION METHOD IN TIME SCALING
Directory of Open Access Journals (Sweden)
ATİLLA GÖKTAŞ
2013-06-01
Full Text Available Box-Cox regression method with λj, for j = 1, 2, ..., k, power transformation can be used when dependent variable and error term of the linear regression model do not satisfy the continuity and normality assumptions. The situation obtaining the smallest mean square error when optimum power λj, transformation for j = 1, 2, ..., k, of Y has been discussed. Box-Cox regression method is especially appropriate to adjust existence skewness or heteroscedasticity of error terms for a nonlinear functional relationship between dependent and explanatory variables. In this study, the advantage and disadvantage use of Box-Cox regression method have been discussed in differentiation and differantial analysis of time scale concept.
Välikangas, Tommi; Suomi, Tomi; Elo, Laura L
2017-05-31
Label-free mass spectrometry (MS) has developed into an important tool applied in various fields of biological and life sciences. Several software exist to process the raw MS data into quantified protein abundances, including open source and commercial solutions. Each software includes a set of unique algorithms for different tasks of the MS data processing workflow. While many of these algorithms have been compared separately, a thorough and systematic evaluation of their overall performance is missing. Moreover, systematic information is lacking about the amount of missing values produced by the different proteomics software and the capabilities of different data imputation methods to account for them.In this study, we evaluated the performance of five popular quantitative label-free proteomics software workflows using four different spike-in data sets. Our extensive testing included the number of proteins quantified and the number of missing values produced by each workflow, the accuracy of detecting differential expression and logarithmic fold change and the effect of different imputation and filtering methods on the differential expression results. We found that the Progenesis software performed consistently well in the differential expression analysis and produced few missing values. The missing values produced by the other software decreased their performance, but this difference could be mitigated using proper data filtering or imputation methods. Among the imputation methods, we found that the local least squares (lls) regression imputation consistently increased the performance of the software in the differential expression analysis, and a combination of both data filtering and local least squares imputation increased performance the most in the tested data sets. © The Author 2017. Published by Oxford University Press.
On two flexible methods of 2-dimensional regression analysis
Czech Academy of Sciences Publication Activity Database
Volf, Petr
2012-01-01
Roč. 18, č. 4 (2012), s. 154-164 ISSN 1803-9782 Grant - others:GA ČR(CZ) GAP209/10/2045 Institutional support: RVO:67985556 Keywords : regression analysis * Gordon surface * prediction error * projection pursuit Subject RIV: BB - Applied Statistics, Operational Research http://library.utia.cas.cz/separaty/2013/SI/volf-on two flexible methods of 2-dimensional regression analysis.pdf
Golino, Hudson F.; Gomes, Cristiano M. A.
2016-01-01
This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…
Thermal Efficiency Degradation Diagnosis Method Using Regression Model
International Nuclear Information System (INIS)
Jee, Chang Hyun; Heo, Gyun Young; Jang, Seok Won; Lee, In Cheol
2011-01-01
This paper proposes an idea for thermal efficiency degradation diagnosis in turbine cycles, which is based on turbine cycle simulation under abnormal conditions and a linear regression model. The correlation between the inputs for representing degradation conditions (normally unmeasured but intrinsic states) and the simulation outputs (normally measured but superficial states) was analyzed with the linear regression model. The regression models can inversely response an associated intrinsic state for a superficial state observed from a power plant. The diagnosis method proposed herein is classified into three processes, 1) simulations for degradation conditions to get measured states (referred as what-if method), 2) development of the linear model correlating intrinsic and superficial states, and 3) determination of an intrinsic state using the superficial states of current plant and the linear regression model (referred as inverse what-if method). The what-if method is to generate the outputs for the inputs including various root causes and/or boundary conditions whereas the inverse what-if method is the process of calculating the inverse matrix with the given superficial states, that is, component degradation modes. The method suggested in this paper was validated using the turbine cycle model for an operating power plant
Penalized regression procedures for variable selection in the potential outcomes framework.
Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L
2015-05-10
A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple 'impute, then select' class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data, and imputation are drawn. A difference least absolute shrinkage and selection operator algorithm is defined, along with its multiple imputation analogs. The procedures are illustrated using a well-known right-heart catheterization dataset. Copyright © 2015 John Wiley & Sons, Ltd.
Directory of Open Access Journals (Sweden)
Rui Martiniano
2017-07-01
Full Text Available We analyse new genomic data (0.05-2.95x from 14 ancient individuals from Portugal distributed from the Middle Neolithic (4200-3500 BC to the Middle Bronze Age (1740-1430 BC and impute genomewide diploid genotypes in these together with published ancient Eurasians. While discontinuity is evident in the transition to agriculture across the region, sensitive haplotype-based analyses suggest a significant degree of local hunter-gatherer contribution to later Iberian Neolithic populations. A more subtle genetic influx is also apparent in the Bronze Age, detectable from analyses including haplotype sharing with both ancient and modern genomes, D-statistics and Y-chromosome lineages. However, the limited nature of this introgression contrasts with the major Steppe migration turnovers within third Millennium northern Europe and echoes the survival of non-Indo-European language in Iberia. Changes in genomic estimates of individual height across Europe are also associated with these major cultural transitions, and ancestral components continue to correlate with modern differences in stature.
Missing value imputation for epistatic MAPs
LENUS (Irish Health Repository)
Ryan, Colm
2010-04-20
Abstract Background Epistatic miniarray profiling (E-MAPs) is a high-throughput approach capable of quantifying aggravating or alleviating genetic interactions between gene pairs. The datasets resulting from E-MAP experiments typically take the form of a symmetric pairwise matrix of interaction scores. These datasets have a significant number of missing values - up to 35% - that can reduce the effectiveness of some data analysis techniques and prevent the use of others. An effective method for imputing interactions would therefore increase the types of possible analysis, as well as increase the potential to identify novel functional interactions between gene pairs. Several methods have been developed to handle missing values in microarray data, but it is unclear how applicable these methods are to E-MAP data because of their pairwise nature and the significantly larger number of missing values. Here we evaluate four alternative imputation strategies, three local (Nearest neighbor-based) and one global (PCA-based), that have been modified to work with symmetric pairwise data. Results We identify different categories for the missing data based on their underlying cause, and show that values from the largest category can be imputed effectively. We compare local and global imputation approaches across a variety of distinct E-MAP datasets, showing that both are competitive and preferable to filling in with zeros. In addition we show that these methods are effective in an E-MAP from a different species, suggesting that pairwise imputation techniques will be increasingly useful as analogous epistasis mapping techniques are developed in different species. We show that strongly alleviating interactions are significantly more difficult to predict than strongly aggravating interactions. Finally we show that imputed interactions, generated using nearest neighbor methods, are enriched for annotations in the same manner as measured interactions. Therefore our method potentially
Public Undertakings and Imputability
DEFF Research Database (Denmark)
Ølykke, Grith Skovgaard
2013-01-01
In this article, the issue of impuability to the State of public undertakings’ decision-making is analysed and discussed in the context of the DSBFirst case. DSBFirst is owned by the independent public undertaking DSB and the private undertaking FirstGroup plc and won the contracts in the 2008...... Oeresund tender for the provision of passenger transport by railway. From the start, the services were provided at a loss, and in the end a part of DSBFirst was wound up. In order to frame the problems illustrated by this case, the jurisprudence-based imputability requirement in the definition of State aid...... in Article 107(1) TFEU is analysed. It is concluded that where the public undertaking transgresses the control system put in place by the State, conditions for imputability are not fulfilled, and it is argued that in the current state of law, there is no conditional link between the level of control...
Estimating the accuracy of geographical imputation
Directory of Open Access Journals (Sweden)
Boscoe Francis P
2008-01-01
Full Text Available Abstract Background To reduce the number of non-geocoded cases researchers and organizations sometimes include cases geocoded to postal code centroids along with cases geocoded with the greater precision of a full street address. Some analysts then use the postal code to assign information to the cases from finer-level geographies such as a census tract. Assignment is commonly completed using either a postal centroid or by a geographical imputation method which assigns a location by using both the demographic characteristics of the case and the population characteristics of the postal delivery area. To date no systematic evaluation of geographical imputation methods ("geo-imputation" has been completed. The objective of this study was to determine the accuracy of census tract assignment using geo-imputation. Methods Using a large dataset of breast, prostate and colorectal cancer cases reported to the New Jersey Cancer Registry, we determined how often cases were assigned to the correct census tract using alternate strategies of demographic based geo-imputation, and using assignments obtained from postal code centroids. Assignment accuracy was measured by comparing the tract assigned with the tract originally identified from the full street address. Results Assigning cases to census tracts using the race/ethnicity population distribution within a postal code resulted in more correctly assigned cases than when using postal code centroids. The addition of age characteristics increased the match rates even further. Match rates were highly dependent on both the geographic distribution of race/ethnicity groups and population density. Conclusion Geo-imputation appears to offer some advantages and no serious drawbacks as compared with the alternative of assigning cases to census tracts based on postal code centroids. For a specific analysis, researchers will still need to consider the potential impact of geocoding quality on their results and evaluate
Linear regression methods a ccording to objective functions
Yasemin Sisman; Sebahattin Bektas
2012-01-01
The aim of the study is to explain the parameter estimation methods and the regression analysis. The simple linear regressionmethods grouped according to the objective function are introduced. The numerical solution is achieved for the simple linear regressionmethods according to objective function of Least Squares and theLeast Absolute Value adjustment methods. The success of the appliedmethods is analyzed using their objective function values.
LinkImputeR: user-guided genotype calling and imputation for non-model organisms.
Money, Daniel; Migicovsky, Zoë; Gardner, Kyle; Myles, Sean
2017-07-10
Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms. Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis. By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from
Comparing parametric and nonparametric regression methods for panel data
DEFF Research Database (Denmark)
Czekaj, Tomasz Gerard; Henningsen, Arne
We investigate and compare the suitability of parametric and non-parametric stochastic regression methods for analysing production technologies and the optimal firm size. Our theoretical analysis shows that the most commonly used functional forms in empirical production analysis, Cobb......-Douglas and Translog, are unsuitable for analysing the optimal firm size. We show that the Translog functional form implies an implausible linear relationship between the (logarithmic) firm size and the elasticity of scale, where the slope is artificially related to the substitutability between the inputs....... The practical applicability of the parametric and non-parametric regression methods is scrutinised and compared by an empirical example: we analyse the production technology and investigate the optimal size of Polish crop farms based on a firm-level balanced panel data set. A nonparametric specification test...
Cost reduction for web-based data imputation
Li, Zhixu
2014-01-01
Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy. © 2014 Springer International Publishing Switzerland.
Fully conditional specification in multivariate imputation
van Buuren, S.; Brand, J. P.L.; Groothuis-Oudshoorn, C. G.M.; Rubin, D. B.
2006-01-01
The use of the Gibbs sampler with fully conditionally specified models, where the distribution of each variable given the other variables is the starting point, has become a popular method to create imputations in incomplete multivariate data. The theoretical weakness of this approach is that the
Sehgal, Muhammad Shoaib B; Gondal, Iqbal; Dooley, Laurence S
2005-05-15
Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE
FATAL, General Experiment Fitting Program by Nonlinear Regression Method
International Nuclear Information System (INIS)
Salmon, L.; Budd, T.; Marshall, M.
1982-01-01
1 - Description of problem or function: A generalized fitting program with a free-format keyword interface to the user. It permits experimental data to be fitted by non-linear regression methods to any function describable by the user. The user requires the minimum of computer experience but needs to provide a subroutine to define his function. Some statistical output is included as well as 'best' estimates of the function's parameters. 2 - Method of solution: The regression method used is based on a minimization technique devised by Powell (Harwell Subroutine Library VA05A, 1972) which does not require the use of analytical derivatives. The method employs a quasi-Newton procedure balanced with a steepest descent correction. Experience shows this to be efficient for a very wide range of application. 3 - Restrictions on the complexity of the problem: The current version of the program permits functions to be defined with up to 20 parameters. The function may be fitted to a maximum of 400 points, preferably with estimated values of weight given
Factors associated with low birth weight in Nepal using multiple imputation
Directory of Open Access Journals (Sweden)
Usha Singh
2017-02-01
Full Text Available Abstract Background Survey data from low income countries on birth weight usually pose a persistent problem. The studies conducted on birth weight have acknowledged missing data on birth weight, but they are not included in the analysis. Furthermore, other missing data presented on determinants of birth weight are not addressed. Thus, this study tries to identify determinants that are associated with low birth weight (LBW using multiple imputation to handle missing data on birth weight and its determinants. Methods The child dataset from Nepal Demographic and Health Survey (NDHS, 2011 was utilized in this study. A total of 5,240 children were born between 2006 and 2011, out of which 87% had at least one measured variable missing and 21% had no recorded birth weight. All the analyses were carried out in R version 3.1.3. Transform-then impute method was applied to check for interaction between explanatory variables and imputed missing data. Survey package was applied to each imputed dataset to account for survey design and sampling method. Survey logistic regression was applied to identify the determinants associated with LBW. Results The prevalence of LBW was 15.4% after imputation. Women with the highest autonomy on their own health compared to those with health decisions involving husband or others (adjusted odds ratio (OR 1.87, 95% confidence interval (95% CI = 1.31, 2.67, and husband and women together (adjusted OR 1.57, 95% CI = 1.05, 2.35 were less likely to give birth to LBW infants. Mothers using highly polluting cooking fuels (adjusted OR 1.49, 95% CI = 1.03, 2.22 were more likely to give birth to LBW infants than mothers using non-polluting cooking fuels. Conclusion The findings of this study suggested that obtaining the prevalence of LBW from only the sample of measured birth weight and ignoring missing data results in underestimation.
Mapping urban environmental noise: a land use regression method.
Xie, Dan; Liu, Yi; Chen, Jining
2011-09-01
Forecasting and preventing urban noise pollution are major challenges in urban environmental management. Most existing efforts, including experiment-based models, statistical models, and noise mapping, however, have limited capacity to explain the association between urban growth and corresponding noise change. Therefore, these conventional methods can hardly forecast urban noise at a given outlook of development layout. This paper, for the first time, introduces a land use regression method, which has been applied for simulating urban air quality for a decade, to construct an urban noise model (LUNOS) in Dalian Municipality, Northwest China. The LUNOS model describes noise as a dependent variable of surrounding various land areas via a regressive function. The results suggest that a linear model performs better in fitting monitoring data, and there is no significant difference of the LUNOS's outputs when applied to different spatial scales. As the LUNOS facilitates a better understanding of the association between land use and urban environmental noise in comparison to conventional methods, it can be regarded as a promising tool for noise prediction for planning purposes and aid smart decision-making.
Dimension Reduction and Discretization in Stochastic Problems by Regression Method
DEFF Research Database (Denmark)
Ditlevsen, Ove Dalager
1996-01-01
The chapter mainly deals with dimension reduction and field discretizations based directly on the concept of linear regression. Several examples of interesting applications in stochastic mechanics are also given.Keywords: Random fields discretization, Linear regression, Stochastic interpolation, ...
Analyzing Big Data with the Hybrid Interval Regression Methods
Directory of Open Access Journals (Sweden)
Chia-Hui Huang
2014-01-01
Full Text Available Big data is a new trend at present, forcing the significant impacts on information technologies. In big data applications, one of the most concerned issues is dealing with large-scale data sets that often require computation resources provided by public cloud services. How to analyze big data efficiently becomes a big challenge. In this paper, we collaborate interval regression with the smooth support vector machine (SSVM to analyze big data. Recently, the smooth support vector machine (SSVM was proposed as an alternative of the standard SVM that has been proved more efficient than the traditional SVM in processing large-scale data. In addition the soft margin method is proposed to modify the excursion of separation margin and to be effective in the gray zone that the distribution of data becomes hard to be described and the separation margin between classes.
DEFF Research Database (Denmark)
Sharifzadeh, Sara; Skytte, Jacob Lercke; Nielsen, Otto Højager Attermann
2012-01-01
Statistical solutions find wide spread use in food and medicine quality control. We investigate the effect of different regression and sparse regression methods for a viscosity estimation problem using the spectro-temporal features from new Sub-Surface Laser Scattering (SLS) vision system. From...... with sparse LAR, lasso and Elastic Net (EN) sparse regression methods. Due to the inconsistent measurement condition, Locally Weighted Scatter plot Smoothing (Loess) has been employed to alleviate the undesired variation in the estimated viscosity. The experimental results of applying different methods show...
Donald Gagliasso; Susan Hummel; Hailemariam. Temesgen
2014-01-01
Various methods have been used to estimate the amount of above ground forest biomass across landscapes and to create biomass maps for specific stands or pixels across ownership or project areas. Without an accurate estimation method, land managers might end up with incorrect biomass estimate maps, which could lead them to make poorer decisions in their future...
Song, Minsun; Wheeler, William; Caporaso, Neil E; Landi, Maria Teresa; Chatterjee, Nilanjan
2018-03-01
Genome-wide association studies (GWAS) are now routinely imputed for untyped single nucleotide polymorphisms (SNPs) based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele counts for imputed SNPs as the dosage variable is known to produce valid score test for genetic association. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic associations incorporating gene-environment interactions. We focus on case-control association studies where inference for an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large-scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also describe simple mechanisms for implementing score tests based on standard meta-analysis of "one-step" maximum-likelihood estimates across studies. Applications of the methods in simulation studies and a dataset from GWAS of lung cancer illustrate ability of the proposed methods to maintain type-I error rates for the underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, the retrospective methods can lead to considerable efficiency gain for modeling of gene-environment interactions under the assumption of gene-environment independence. Methods are made available for public use through CGEN R software package. © 2017 WILEY PERIODICALS, INC.
Clustering with Missing Values: No Imputation Required
Wagstaff, Kiri
2004-01-01
Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.
Multiple Improvements of Multiple Imputation Likelihood Ratio Tests
Chan, Kin Wai; Meng, Xiao-Li
2017-01-01
Multiple imputation (MI) inference handles missing data by first properly imputing the missing values $m$ times, and then combining the $m$ analysis results from applying a complete-data procedure to each of the completed datasets. However, the existing method for combining likelihood ratio tests has multiple defects: (i) the combined test statistic can be negative in practice when the reference null distribution is a standard $F$ distribution; (ii) it is not invariant to re-parametrization; ...
Bootstrap inference when using multiple imputation.
Schomaker, Michael; Heumann, Christian
2018-04-16
Many modern estimators require bootstrapping to calculate confidence intervals because either no analytic standard error is available or the distribution of the parameter of interest is nonsymmetric. It remains however unclear how to obtain valid bootstrap inference when dealing with multiple imputation to address missing data. We present 4 methods that are intuitively appealing, easy to implement, and combine bootstrap estimation with multiple imputation. We show that 3 of the 4 approaches yield valid inference, but that the performance of the methods varies with respect to the number of imputed data sets and the extent of missingness. Simulation studies reveal the behavior of our approaches in finite samples. A topical analysis from HIV treatment research, which determines the optimal timing of antiretroviral treatment initiation in young children, demonstrates the practical implications of the 4 methods in a sophisticated and realistic setting. This analysis suffers from missing data and uses the g-formula for inference, a method for which no standard errors are available. Copyright © 2018 John Wiley & Sons, Ltd.
Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions
Turrado, Concepción Crespo; López, María del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; de Cos Juez, Francisco Javier
2014-01-01
Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW. PMID:25356644
Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions
Directory of Open Access Journals (Sweden)
Concepción Crespo Turrado
2014-10-01
Full Text Available Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE. This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW and Multiple Linear Regression (MLR. The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.
Missing data imputation of solar radiation data under different atmospheric conditions.
Turrado, Concepción Crespo; López, María Del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; Juez, Francisco Javier de Cos
2014-10-29
Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.
Methods of Detecting Outliers in A Regression Analysis Model ...
African Journals Online (AJOL)
PROF. O. E. OSUAGWU
2013-06-01
Jun 1, 2013 ... especially true in observational studies .... Simple linear regression and multiple ... The simple linear ..... Grubbs,F.E (1950): Sample Criteria for Testing Outlying observations: Annals of ... In experimental design, the Relative.
Multiply-Imputed Synthetic Data: Advice to the Imputer
Directory of Open Access Journals (Sweden)
Loong Bronwyn
2017-12-01
Full Text Available Several statistical agencies have started to use multiply-imputed synthetic microdata to create public-use data in major surveys. The purpose of doing this is to protect the confidentiality of respondents’ identities and sensitive attributes, while allowing standard complete-data analyses of microdata. A key challenge, faced by advocates of synthetic data, is demonstrating that valid statistical inferences can be obtained from such synthetic data for non-confidential questions. Large discrepancies between observed-data and synthetic-data analytic results for such questions may arise because of uncongeniality; that is, differences in the types of inputs available to the imputer, who has access to the actual data, and to the analyst, who has access only to the synthetic data. Here, we discuss a simple, but possibly canonical, example of uncongeniality when using multiple imputation to create synthetic data, which specifically addresses the choices made by the imputer. An initial, unanticipated but not surprising, conclusion is that non-confidential design information used to impute synthetic data should be released with the confidential synthetic data to allow users of synthetic data to avoid possible grossly conservative inferences.
Analysis of some methods for reduced rank Gaussian process regression
DEFF Research Database (Denmark)
Quinonero-Candela, J.; Rasmussen, Carl Edward
2005-01-01
While there is strong motivation for using Gaussian Processes (GPs) due to their excellent performance in regression and classification problems, their computational complexity makes them impractical when the size of the training set exceeds a few thousand cases. This has motivated the recent...... proliferation of a number of cost-effective approximations to GPs, both for classification and for regression. In this paper we analyze one popular approximation to GPs for regression: the reduced rank approximation. While generally GPs are equivalent to infinite linear models, we show that Reduced Rank...... Gaussian Processes (RRGPs) are equivalent to finite sparse linear models. We also introduce the concept of degenerate GPs and show that they correspond to inappropriate priors. We show how to modify the RRGP to prevent it from being degenerate at test time. Training RRGPs consists both in learning...
A web-based approach to data imputation
Li, Zhixu
2013-10-24
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques. © 2013 Springer Science+Business Media New York.
Chen, Carla Chia-Ming; Schwender, Holger; Keith, Jonathan; Nunkesser, Robin; Mengersen, Kerrie; Macrossan, Paula
2011-01-01
Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.
DEFF Research Database (Denmark)
Kirkeby, Carsten Thure; Hisham Beshara Halasa, Tariq; Gussmann, Maya Katrin
2017-01-01
the transmission rate. We use data from the two simulation models and vary the sampling intervals and the size of the population sampled. We devise two new methods to determine transmission rate, and compare these to the frequently used Poisson regression method in both epidemic and endemic situations. For most...... tested scenarios these new methods perform similar or better than Poisson regression, especially in the case of long sampling intervals. We conclude that transmission rate estimates are easily biased, which is important to take into account when using these rates in simulation models....
Helmreich, James E.; Krog, K. Peter
2018-01-01
We present a short, inquiry-based learning course on concepts and methods underlying ordinary least squares (OLS), least absolute deviation (LAD), and quantile regression (QR). Students investigate squared, absolute, and weighted absolute distance functions (metrics) as location measures. Using differential calculus and properties of convex…
Flexible Imputation of Missing Data
van Buuren, Stef
2012-01-01
Missing data form a problem in every scientific discipline, yet the techniques required to handle them are complicated and often lacking. One of the great ideas in statistical science--multiple imputation--fills gaps in the data with plausible values, the uncertainty of which is coded in the data itself. It also solves other problems, many of which are missing data problems in disguise. Flexible Imputation of Missing Data is supported by many examples using real data taken from the author's vast experience of collaborative research, and presents a practical guide for handling missing data unde
Seaman, Shaun R; Hughes, Rachael A
2018-06-01
Estimating the parameters of a regression model of interest is complicated by missing data on the variables in that model. Multiple imputation is commonly used to handle these missing data. Joint model multiple imputation and full-conditional specification multiple imputation are known to yield imputed data with the same asymptotic distribution when the conditional models of full-conditional specification are compatible with that joint model. We show that this asymptotic equivalence of imputation distributions does not imply that joint model multiple imputation and full-conditional specification multiple imputation will also yield asymptotically equally efficient inference about the parameters of the model of interest, nor that they will be equally robust to misspecification of the joint model. When the conditional models used by full-conditional specification multiple imputation are linear, logistic and multinomial regressions, these are compatible with a restricted general location joint model. We show that multiple imputation using the restricted general location joint model can be substantially more asymptotically efficient than full-conditional specification multiple imputation, but this typically requires very strong associations between variables. When associations are weaker, the efficiency gain is small. Moreover, full-conditional specification multiple imputation is shown to be potentially much more robust than joint model multiple imputation using the restricted general location model to mispecification of that model when there is substantial missingness in the outcome variable.
Differential network analysis with multiply imputed lipidomic data.
Directory of Open Access Journals (Sweden)
Maiju Kujala
Full Text Available The importance of lipids for cell function and health has been widely recognized, e.g., a disorder in the lipid composition of cells has been related to atherosclerosis caused cardiovascular disease (CVD. Lipidomics analyses are characterized by large yet not a huge number of mutually correlated variables measured and their associations to outcomes are potentially of a complex nature. Differential network analysis provides a formal statistical method capable of inferential analysis to examine differences in network structures of the lipids under two biological conditions. It also guides us to identify potential relationships requiring further biological investigation. We provide a recipe to conduct permutation test on association scores resulted from partial least square regression with multiple imputed lipidomic data from the LUdwigshafen RIsk and Cardiovascular Health (LURIC study, particularly paying attention to the left-censored missing values typical for a wide range of data sets in life sciences. Left-censored missing values are low-level concentrations that are known to exist somewhere between zero and a lower limit of quantification. To make full use of the LURIC data with the missing values, we utilize state of the art multiple imputation techniques and propose solutions to the challenges that incomplete data sets bring to differential network analysis. The customized network analysis helps us to understand the complexities of the underlying biological processes by identifying lipids and lipid classes that interact with each other, and by recognizing the most important differentially expressed lipids between two subgroups of coronary artery disease (CAD patients, the patients that had a fatal CVD event and the ones who remained stable during two year follow-up.
Finding-equal regression method and its application in predication of U resources
International Nuclear Information System (INIS)
Cao Huimo
1995-03-01
The commonly adopted deposit model method in mineral resources predication has two main part: one is model data that show up geological mineralization law for deposit, the other is statistics predication method that accords with characters of the data namely pretty regression method. This kind of regression method may be called finding-equal regression, which is made of the linear regression and distribution finding-equal method. Because distribution finding-equal method is a data pretreatment which accords with advanced mathematical precondition for the linear regression namely equal distribution theory, and this kind of data pretreatment is possible of realization. Therefore finding-equal regression not only can overcome nonlinear limitations, that are commonly occurred in traditional linear regression or other regression and always have no solution, but also can distinguish outliers and eliminate its weak influence, which would usually appeared when Robust regression possesses outlier in independent variables. Thus this newly finding-equal regression stands the best status in all kind of regression methods. Finally, two good examples of U resource quantitative predication are provided
Data imputation analysis for Cosmic Rays time series
Fernandes, R. C.; Lucio, P. S.; Fernandez, J. H.
2017-05-01
The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechanical and human failure or technical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was used. Three different methods for monthly dataset imputation were selected: AMÉLIA II - runs the bootstrap Expectation Maximization algorithm, MICE - runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI - an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test and t-test. The results showed that for CLMX and ROME, the R2 and R statistics were equal to 0.98 and 0.96, respectively. It was observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.
Missing value imputation: with application to handwriting data
Xu, Zhen; Srihari, Sargur N.
2015-01-01
Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research, missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal data with missing values in handwriting analysis. In the task of studying development of individuality of handwriting, we encountered the fact that feature values are missing for several individuals at several time instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation, and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM), are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and low computational cost.
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Su, Edwin P; Grauer, Jonathan N
2018-03-01
Despite the advantages of large, national datasets, one continuing concern is missing data values. Complete case analysis, where only cases with complete data are analyzed, is commonly used rather than more statistically rigorous approaches such as multiple imputation. This study characterizes the potential selection bias introduced using complete case analysis and compares the results of common regressions using both techniques following unicompartmental knee arthroplasty. Patients undergoing unicompartmental knee arthroplasty were extracted from the 2005 to 2015 National Surgical Quality Improvement Program. As examples, the demographics of patients with and without missing preoperative albumin and hematocrit values were compared. Missing data were then treated with both complete case analysis and multiple imputation (an approach that reproduces the variation and associations that would have been present in a full dataset) and the conclusions of common regressions for adverse outcomes were compared. A total of 6117 patients were included, of which 56.7% were missing at least one value. Younger, female, and healthier patients were more likely to have missing preoperative albumin and hematocrit values. The use of complete case analysis removed 3467 patients from the study in comparison with multiple imputation which included all 6117 patients. The 2 methods of handling missing values led to differing associations of low preoperative laboratory values with commonly studied adverse outcomes. The use of complete case analysis can introduce selection bias and may lead to different conclusions in comparison with the statistically rigorous multiple imputation approach. Joint surgeons should consider the methods of handling missing values when interpreting arthroplasty research. Copyright © 2017 Elsevier Inc. All rights reserved.
Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M
2018-06-01
A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.
Using imputation to provide location information for nongeocoded addresses.
Directory of Open Access Journals (Sweden)
Frank C Curriero
2010-02-01
Full Text Available The importance of geography as a source of variation in health research continues to receive sustained attention in the literature. The inclusion of geographic information in such research often begins by adding data to a map which is predicated by some knowledge of location. A precise level of spatial information is conventionally achieved through geocoding, the geographic information system (GIS process of translating mailing address information to coordinates on a map. The geocoding process is not without its limitations, though, since there is always a percentage of addresses which cannot be converted successfully (nongeocodable. This raises concerns regarding bias since traditionally the practice has been to exclude nongeocoded data records from analysis.In this manuscript we develop and evaluate a set of imputation strategies for dealing with missing spatial information from nongeocoded addresses. The strategies are developed assuming a known zip code with increasing use of collateral information, namely the spatial distribution of the population at risk. Strategies are evaluated using prostate cancer data obtained from the Maryland Cancer Registry. We consider total case enumerations at the Census county, tract, and block group level as the outcome of interest when applying and evaluating the methods. Multiple imputation is used to provide estimated total case counts based on complete data (geocodes plus imputed nongeocodes with a measure of uncertainty. Results indicate that the imputation strategy based on using available population-based age, gender, and race information performed the best overall at the county, tract, and block group levels.The procedure allows for the potentially biased and likely under reported outcome, case enumerations based on only the geocoded records, to be presented with a statistically adjusted count (imputed count with a measure of uncertainty that are based on all the case data, the geocodes and imputed
Directory of Open Access Journals (Sweden)
Hailun Wang
2017-01-01
Full Text Available Support vector regression algorithm is widely used in fault diagnosis of rolling bearing. A new model parameter selection method for support vector regression based on adaptive fusion of the mixed kernel function is proposed in this paper. We choose the mixed kernel function as the kernel function of support vector regression. The mixed kernel function of the fusion coefficients, kernel function parameters, and regression parameters are combined together as the parameters of the state vector. Thus, the model selection problem is transformed into a nonlinear system state estimation problem. We use a 5th-degree cubature Kalman filter to estimate the parameters. In this way, we realize the adaptive selection of mixed kernel function weighted coefficients and the kernel parameters, the regression parameters. Compared with a single kernel function, unscented Kalman filter (UKF support vector regression algorithms, and genetic algorithms, the decision regression function obtained by the proposed method has better generalization ability and higher prediction accuracy.
Directory of Open Access Journals (Sweden)
Luciana Neves Nunes
2010-12-01
Full Text Available INTRODUÇÃO: A perda de informações é um problema frequente em estudos realizados na área da Saúde. Na literatura essa perda é chamada de missing data ou dados faltantes. Através da imputação dos dados faltantes são criados conjuntos de dados artificialmente completos que podem ser analisados por técnicas estatísticas tradicionais. O objetivo desse artigo foi comparar, em um exemplo baseado em dados reais, a utilização de três técnicas de imputações diferentes. MÉTODO: Os dados utilizados referem-se a um estudo de desenvolvimento de modelo de risco cirúrgico, sendo que o tamanho da amostra foi de 450 pacientes. Os métodos de imputação empregados foram duas imputações únicas e uma imputação múltipla (IM, e a suposição sobre o mecanismo de não-resposta foi MAR (Missing at Random. RESULTADOS: A variável com dados faltantes foi a albumina sérica, com 27,1% de perda. Os modelos obtidos pelas imputações únicas foram semelhantes entre si, mas diferentes dos obtidos com os dados imputados pela IM quanto à inclusão de variáveis nos modelos. CONCLUSÕES: Os resultados indicam que faz diferença levar em conta a relação da albumina com outras variáveis observadas, pois foram obtidos modelos diferentes nas imputações única e múltipla. A imputação única subestima a variabilidade, gerando intervalos de confiança mais estreitos. É importante se considerar o uso de métodos de imputação quando há dados faltantes, especialmente a IM que leva em conta a variabilidade entre imputações para as estimativas do modelo.INTRODUCTION: It is common for studies in health to face problems with missing data. Through imputation, complete data sets are built artificially and can be analyzed by traditional statistical analysis. The objective of this paper is to compare three types of imputation based on real data. METHODS: The data used came from a study on the development of risk models for surgical mortality. The
Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets
Directory of Open Access Journals (Sweden)
Min-Wei Huang
2018-01-01
Full Text Available Many real-world medical datasets contain some proportion of missing (attribute values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.
Synthetic Multiple-Imputation Procedure for Multistage Complex Samples
Directory of Open Access Journals (Sweden)
Zhou Hanzhi
2016-03-01
Full Text Available Multiple imputation (MI is commonly used when item-level missing data are present. However, MI requires that survey design information be built into the imputation models. For multistage stratified clustered designs, this requires dummy variables to represent strata as well as primary sampling units (PSUs nested within each stratum in the imputation model. Such a modeling strategy is not only operationally burdensome but also inferentially inefficient when there are many strata in the sample design. Complexity only increases when sampling weights need to be modeled. This article develops a generalpurpose analytic strategy for population inference from complex sample designs with item-level missingness. In a simulation study, the proposed procedures demonstrate efficient estimation and good coverage properties. We also consider an application to accommodate missing body mass index (BMI data in the analysis of BMI percentiles using National Health and Nutrition Examination Survey (NHANES III data. We argue that the proposed methods offer an easy-to-implement solution to problems that are not well-handled by current MI techniques. Note that, while the proposed method borrows from the MI framework to develop its inferential methods, it is not designed as an alternative strategy to release multiply imputed datasets for complex sample design data, but rather as an analytic strategy in and of itself.
The Use of Nonparametric Kernel Regression Methods in Econometric Production Analysis
DEFF Research Database (Denmark)
Czekaj, Tomasz Gerard
and nonparametric estimations of production functions in order to evaluate the optimal firm size. The second paper discusses the use of parametric and nonparametric regression methods to estimate panel data regression models. The third paper analyses production risk, price uncertainty, and farmers' risk preferences...... within a nonparametric panel data regression framework. The fourth paper analyses the technical efficiency of dairy farms with environmental output using nonparametric kernel regression in a semiparametric stochastic frontier analysis. The results provided in this PhD thesis show that nonparametric......This PhD thesis addresses one of the fundamental problems in applied econometric analysis, namely the econometric estimation of regression functions. The conventional approach to regression analysis is the parametric approach, which requires the researcher to specify the form of the regression...
Imputation of the rare HOXB13 G84E mutation and cancer risk in a large population-based cohort.
Directory of Open Access Journals (Sweden)
Thomas J Hoffmann
2015-01-01
Full Text Available An efficient approach to characterizing the disease burden of rare genetic variants is to impute them into large well-phenotyped cohorts with existing genome-wide genotype data using large sequenced referenced panels. The success of this approach hinges on the accuracy of rare variant imputation, which remains controversial. For example, a recent study suggested that one cannot adequately impute the HOXB13 G84E mutation associated with prostate cancer risk (carrier frequency of 0.0034 in European ancestry participants in the 1000 Genomes Project. We show that by utilizing the 1000 Genomes Project data plus an enriched reference panel of mutation carriers we were able to accurately impute the G84E mutation into a large cohort of 83,285 non-Hispanic White participants from the Kaiser Permanente Research Program on Genes, Environment and Health Genetic Epidemiology Research on Adult Health and Aging cohort. Imputation authenticity was confirmed via a novel classification and regression tree method, and then empirically validated analyzing a subset of these subjects plus an additional 1,789 men from Kaiser specifically genotyped for the G84E mutation (r2 = 0.57, 95% CI = 0.37–0.77. We then show the value of this approach by using the imputed data to investigate the impact of the G84E mutation on age-specific prostate cancer risk and on risk of fourteen other cancers in the cohort. The age-specific risk of prostate cancer among G84E mutation carriers was higher than among non-carriers. Risk estimates from Kaplan-Meier curves were 36.7% versus 13.6% by age 72, and 64.2% versus 24.2% by age 80, for G84E mutation carriers and non-carriers, respectively (p = 3.4x10-12. The G84E mutation was also associated with an increase in risk for the fourteen other most common cancers considered collectively (p = 5.8x10-4 and more so in cases diagnosed with multiple cancer types, both those including and not including prostate cancer, strongly suggesting
Easy methods for extracting individual regression slopes: Comparing SPSS, R, and Excel
Directory of Open Access Journals (Sweden)
Roland Pfister
2013-10-01
Full Text Available Three different methods for extracting coefficientsof linear regression analyses are presented. The focus is on automatic and easy-to-use approaches for common statistical packages: SPSS, R, and MS Excel / LibreOffice Calc. Hands-on examples are included for each analysis, followed by a brief description of how a subsequent regression coefficient analysis is performed.
Anderson, Carl A; McRae, Allan F; Visscher, Peter M
2006-07-01
Standard quantitative trait loci (QTL) mapping techniques commonly assume that the trait is both fully observed and normally distributed. When considering survival or age-at-onset traits these assumptions are often incorrect. Methods have been developed to map QTL for survival traits; however, they are both computationally intensive and not available in standard genome analysis software packages. We propose a grouped linear regression method for the analysis of continuous survival data. Using simulation we compare this method to both the Cox and Weibull proportional hazards models and a standard linear regression method that ignores censoring. The grouped linear regression method is of equivalent power to both the Cox and Weibull proportional hazards methods and is significantly better than the standard linear regression method when censored observations are present. The method is also robust to the proportion of censored individuals and the underlying distribution of the trait. On the basis of linear regression methodology, the grouped linear regression model is computationally simple and fast and can be implemented readily in freely available statistical software.
Gusriani, N.; Firdaniza
2018-03-01
The existence of outliers on multiple linear regression analysis causes the Gaussian assumption to be unfulfilled. If the Least Square method is forcedly used on these data, it will produce a model that cannot represent most data. For that, we need a robust regression method against outliers. This paper will compare the Minimum Covariance Determinant (MCD) method and the TELBS method on secondary data on the productivity of phytoplankton, which contains outliers. Based on the robust determinant coefficient value, MCD method produces a better model compared to TELBS method.
Energy Technology Data Exchange (ETDEWEB)
Lopez Fontan, J.L.; Costa, J.; Ruso, J.M.; Prieto, G. [Dept. of Applied Physics, Univ. of Santiago de Compostela, Santiago de Compostela (Spain); Sarmiento, F. [Dept. of Mathematics, Faculty of Informatics, Univ. of A Coruna, A Coruna (Spain)
2004-02-01
The application of a statistical method, the local polynomial regression method, (LPRM), based on a nonparametric estimation of the regression function to determine the critical micelle concentration (cmc) is presented. The method is extremely flexible because it does not impose any parametric model on the subjacent structure of the data but rather allows the data to speak for themselves. Good concordance of cmc values with those obtained by other methods was found for systems in which the variation of a measured physical property with concentration showed an abrupt change. When this variation was slow, discrepancies between the values obtained by LPRM and others methods were found. (orig.)
Linear Regression with a Randomly Censored Covariate: Application to an Alzheimer's Study.
Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A
2017-01-01
The association between maternal age of onset of dementia and amyloid deposition (measured by in vivo positron emission tomography (PET) imaging) in cognitively normal older offspring is of interest. In a regression model for amyloid, special methods are required due to the random right censoring of the covariate of maternal age of onset of dementia. Prior literature has proposed methods to address the problem of censoring due to assay limit of detection, but not random censoring. We propose imputation methods and a survival regression method that do not require parametric assumptions about the distribution of the censored covariate. Existing imputation methods address missing covariates, but not right censored covariates. In simulation studies, we compare these methods to the simple, but inefficient complete case analysis, and to thresholding approaches. We apply the methods to the Alzheimer's study.
Missing value imputation for microarray gene expression data using histone acetylation information
Directory of Open Access Journals (Sweden)
Feng Jihua
2008-05-01
Full Text Available Abstract Background It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages. Results The paper explores the feasibility of doing missing value imputation with the help of gene regulatory mechanism. An imputation framework called histone acetylation information aided imputation method (HAIimpute method is presented. It incorporates the histone acetylation information into the conventional KNN(k-nearest neighbor and LLS(local least square imputation algorithms for final prediction of the missing values. The experimental results indicated that the use of acetylation information can provide significant improvements in microarray imputation accuracy. The HAIimpute methods consistently improve the widely used methods such as KNN and LLS in terms of normalized root mean squared error (NRMSE. Meanwhile, the genes imputed by HAIimpute methods are more correlated with the original complete genes in terms of Pearson correlation coefficients. Furthermore, the proposed methods also outperform GOimpute, which is one of the existing related methods that use the functional similarity as the external information. Conclusion We demonstrated that the using of histone acetylation information could greatly improve the performance of the imputation especially at high missing percentages. This idea can be generalized to various imputation methods to facilitate the performance. Moreover, with more knowledge accumulated on gene regulatory mechanism in addition to histone acetylation, the performance of our approach can be further improved and verified.
Fuzzy Linear Regression for the Time Series Data which is Fuzzified with SMRGT Method
Directory of Open Access Journals (Sweden)
Seçil YALAZ
2016-10-01
Full Text Available Our work on regression and classification provides a new contribution to the analysis of time series used in many areas for years. Owing to the fact that convergence could not obtained with the methods used in autocorrelation fixing process faced with time series regression application, success is not met or fall into obligation of changing the models’ degree. Changing the models’ degree may not be desirable in every situation. In our study, recommended for these situations, time series data was fuzzified by using the simple membership function and fuzzy rule generation technique (SMRGT and to estimate future an equation has created by applying fuzzy least square regression (FLSR method which is a simple linear regression method to this data. Although SMRGT has success in determining the flow discharge in open channels and can be used confidently for flow discharge modeling in open canals, as well as in pipe flow with some modifications, there is no clue about that this technique is successful in fuzzy linear regression modeling. Therefore, in order to address the luck of such a modeling, a new hybrid model has been described within this study. In conclusion, to demonstrate our methods’ efficiency, classical linear regression for time series data and linear regression for fuzzy time series data were applied to two different data sets, and these two approaches performances were compared by using different measures.
An NCME Instructional Module on Data Mining Methods for Classification and Regression
Sinharay, Sandip
2016-01-01
Data mining methods for classification and regression are becoming increasingly popular in various scientific fields. However, these methods have not been explored much in educational measurement. This module first provides a review, which should be accessible to a wide audience in education measurement, of some of these methods. The module then…
Cohen, Ayala; Nahum-Shani, Inbal; Doveh, Etti
2010-01-01
In their seminal paper, Edwards and Parry (1993) presented the polynomial regression as a better alternative to applying difference score in the study of congruence. Although this method is increasingly applied in congruence research, its complexity relative to other methods for assessing congruence (e.g., difference score methods) was one of the…
Statistical approach for selection of regression model during validation of bioanalytical method
Directory of Open Access Journals (Sweden)
Natalija Nakov
2014-06-01
Full Text Available The selection of an adequate regression model is the basis for obtaining accurate and reproducible results during the bionalytical method validation. Given the wide concentration range, frequently present in bioanalytical assays, heteroscedasticity of the data may be expected. Several weighted linear and quadratic regression models were evaluated during the selection of the adequate curve fit using nonparametric statistical tests: One sample rank test and Wilcoxon signed rank test for two independent groups of samples. The results obtained with One sample rank test could not give statistical justification for the selection of linear vs. quadratic regression models because slight differences between the error (presented through the relative residuals were obtained. Estimation of the significance of the differences in the RR was achieved using Wilcoxon signed rank test, where linear and quadratic regression models were treated as two independent groups. The application of this simple non-parametric statistical test provides statistical confirmation of the choice of an adequate regression model.
Bernhardt, Paul W; Wang, Huixia Judy; Zhang, Daowen
2014-01-01
Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.
Roth, Philip L; Le, Huy; Oh, In-Sue; Van Iddekinge, Chad H; Bobko, Philip
2018-06-01
Meta-analysis has become a well-accepted method for synthesizing empirical research about a given phenomenon. Many meta-analyses focus on synthesizing correlations across primary studies, but some primary studies do not report correlations. Peterson and Brown (2005) suggested that researchers could use standardized regression weights (i.e., beta coefficients) to impute missing correlations. Indeed, their beta estimation procedures (BEPs) have been used in meta-analyses in a wide variety of fields. In this study, the authors evaluated the accuracy of BEPs in meta-analysis. We first examined how use of BEPs might affect results from a published meta-analysis. We then developed a series of Monte Carlo simulations that systematically compared the use of existing correlations (that were not missing) to data sets that incorporated BEPs (that impute missing correlations from corresponding beta coefficients). These simulations estimated ρ̄ (mean population correlation) and SDρ (true standard deviation) across a variety of meta-analytic conditions. Results from both the existing meta-analysis and the Monte Carlo simulations revealed that BEPs were associated with potentially large biases when estimating ρ̄ and even larger biases when estimating SDρ. Using only existing correlations often substantially outperformed use of BEPs and virtually never performed worse than BEPs. Overall, the authors urge a return to the standard practice of using only existing correlations in meta-analysis. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Directory of Open Access Journals (Sweden)
ELİF BULUT
2013-06-01
Full Text Available Partial Least Squares Regression (PLSR is a multivariate statistical method that consists of partial least squares and multiple linear regression analysis. Explanatory variables, X, having multicollinearity are reduced to components which explain the great amount of covariance between explanatory and response variable. These components are few in number and they don’t have multicollinearity problem. Then multiple linear regression analysis is applied to those components to model the response variable Y. There are various PLSR algorithms. In this study NIPALS and PLS-Kernel algorithms will be studied and illustrated on a real data set.
The Bland-Altman Method Should Not Be Used in Regression Cross-Validation Studies
O'Connor, Daniel P.; Mahar, Matthew T.; Laughlin, Mitzi S.; Jackson, Andrew S.
2011-01-01
The purpose of this study was to demonstrate the bias in the Bland-Altman (BA) limits of agreement method when it is used to validate regression models. Data from 1,158 men were used to develop three regression equations to estimate maximum oxygen uptake (R[superscript 2] = 0.40, 0.61, and 0.82, respectively). The equations were evaluated in a…
Sparling, D.W.; Barzen, J.A.; Lovvorn, J.R.; Serie, J.R.
1992-01-01
Regression equations that use mensural data to estimate body condition have been developed for several water birds. These equations often have been based on data that represent different sexes, age classes, or seasons, without being adequately tested for intergroup differences. We used proximate carcass analysis of 538 adult and juvenile canvasbacks (Aythya valisineria ) collected during fall migration, winter, and spring migrations in 1975-76 and 1982-85 to test regression methods for estimating body condition.
Treating experimental data of inverse kinetic method by unitary linear regression analysis
International Nuclear Information System (INIS)
Zhao Yusen; Chen Xiaoliang
2009-01-01
The theory of treating experimental data of inverse kinetic method by unitary linear regression analysis was described. Not only the reactivity, but also the effective neutron source intensity could be calculated by this method. Computer code was compiled base on the inverse kinetic method and unitary linear regression analysis. The data of zero power facility BFS-1 in Russia were processed and the results were compared. The results show that the reactivity and the effective neutron source intensity can be obtained correctly by treating experimental data of inverse kinetic method using unitary linear regression analysis and the precision of reactivity measurement is improved. The central element efficiency can be calculated by using the reactivity. The result also shows that the effect to reactivity measurement caused by external neutron source should be considered when the reactor power is low and the intensity of external neutron source is strong. (authors)
Regression Methods for Virtual Metrology of Layer Thickness in Chemical Vapor Deposition
DEFF Research Database (Denmark)
Purwins, Hendrik; Barak, Bernd; Nagi, Ahmed
2014-01-01
The quality of wafer production in semiconductor manufacturing cannot always be monitored by a costly physical measurement. Instead of measuring a quantity directly, it can be predicted by a regression method (Virtual Metrology). In this paper, a survey on regression methods is given to predict...... average Silicon Nitride cap layer thickness for the Plasma Enhanced Chemical Vapor Deposition (PECVD) dual-layer metal passivation stack process. Process and production equipment Fault Detection and Classification (FDC) data are used as predictor variables. Various variable sets are compared: one most...... algorithm, and Support Vector Regression (SVR). On a test set, SVR outperforms the other methods by a large margin, being more robust towards changes in the production conditions. The method performs better on high-dimensional multivariate input data than on the most predictive variables alone. Process...
Statistical methods in regression and calibration analysis of chromosome aberration data
International Nuclear Information System (INIS)
Merkle, W.
1983-01-01
The method of iteratively reweighted least squares for the regression analysis of Poisson distributed chromosome aberration data is reviewed in the context of other fit procedures used in the cytogenetic literature. As an application of the resulting regression curves methods for calculating confidence intervals on dose from aberration yield are described and compared, and, for the linear quadratic model a confidence interval is given. Emphasis is placed on the rational interpretation and the limitations of various methods from a statistical point of view. (orig./MG)
Imputation of missing data in time series for air pollutants
Junger, W. L.; Ponce de Leon, A.
2015-02-01
Missing data are major concerns in epidemiological studies of the health effects of environmental air pollutants. This article presents an imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution. Different approaches are considered for filtering the temporal component. A simulation study was performed to assess validity and performance of proposed method in comparison with some frequently used methods. Simulations showed that when the amount of missing data was as low as 5%, the complete data analysis yielded satisfactory results regardless of the generating mechanism of the missing data, whereas the validity began to degenerate when the proportion of missing values exceeded 10%. The proposed imputation method exhibited good accuracy and precision in different settings with respect to the patterns of missing observations. Most of the imputations obtained valid results, even under missing not at random. The methods proposed in this study are implemented as a package called mtsdi for the statistical software system R.
Thompson, Russel L.
Homoscedasticity is an important assumption of linear regression. This paper explains what it is and why it is important to the researcher. Graphical and mathematical methods for testing the homoscedasticity assumption are demonstrated. Sources of homoscedasticity and types of homoscedasticity are discussed, and methods for correction are…
Calculation of U, Ra, Th and K contents in uranium ore by multiple linear regression method
International Nuclear Information System (INIS)
Lin Chao; Chen Yingqiang; Zhang Qingwen; Tan Fuwen; Peng Guanghui
1991-01-01
A multiple linear regression method was used to compute γ spectra of uranium ore samples and to calculate contents of U, Ra, Th, and K. In comparison with the inverse matrix method, its advantage is that no standard samples of pure U, Ra, Th and K are needed for obtaining response coefficients
Martens, Edwin P; de Boer, Anthonius; Pestman, Wiebe R; Belitser, Svetlana V; Stricker, Bruno H Ch; Klungel, Olaf H
PURPOSE: To compare adjusted effects of drug treatment for hypertension on the risk of stroke from propensity score (PS) methods with a multivariable Cox proportional hazards (Cox PH) regression in an observational study with censored data. METHODS: From two prospective population-based cohort
Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding
de los Campos, Gustavo; Hickey, John M.; Pong-Wong, Ricardo; Daetwyler, Hans D.; Calus, Mario P. L.
2013-01-01
Genomic-enabled prediction is becoming increasingly important in animal and plant breeding and is also receiving attention in human genetics. Deriving accurate predictions of complex traits requires implementing whole-genome regression (WGR) models where phenotypes are regressed on thousands of markers concurrently. Methods exist that allow implementing these large-p with small-n regressions, and genome-enabled selection (GS) is being implemented in several plant and animal breeding programs. The list of available methods is long, and the relationships between them have not been fully addressed. In this article we provide an overview of available methods for implementing parametric WGR models, discuss selected topics that emerge in applications, and present a general discussion of lessons learned from simulation and empirical data analysis in the last decade. PMID:22745228
Lopes, F B; Wu, X-L; Li, H; Xu, J; Perkins, T; Genho, J; Ferretti, R; Tait, R G; Bauck, S; Rosa, G J M
2018-02-01
Reliable genomic prediction of breeding values for quantitative traits requires the availability of sufficient number of animals with genotypes and phenotypes in the training set. As of 31 October 2016, there were 3,797 Brangus animals with genotypes and phenotypes. These Brangus animals were genotyped using different commercial SNP chips. Of them, the largest group consisted of 1,535 animals genotyped by the GGP-LDV4 SNP chip. The remaining 2,262 genotypes were imputed to the SNP content of the GGP-LDV4 chip, so that the number of animals available for training the genomic prediction models was more than doubled. The present study showed that the pooling of animals with both original or imputed 40K SNP genotypes substantially increased genomic prediction accuracies on the ten traits. By supplementing imputed genotypes, the relative gains in genomic prediction accuracies on estimated breeding values (EBV) were from 12.60% to 31.27%, and the relative gain in genomic prediction accuracies on de-regressed EBV was slightly small (i.e. 0.87%-18.75%). The present study also compared the performance of five genomic prediction models and two cross-validation methods. The five genomic models predicted EBV and de-regressed EBV of the ten traits similarly well. Of the two cross-validation methods, leave-one-out cross-validation maximized the number of animals at the stage of training for genomic prediction. Genomic prediction accuracy (GPA) on the ten quantitative traits was validated in 1,106 newly genotyped Brangus animals based on the SNP effects estimated in the previous set of 3,797 Brangus animals, and they were slightly lower than GPA in the original data. The present study was the first to leverage currently available genotype and phenotype resources in order to harness genomic prediction in Brangus beef cattle. © 2018 Blackwell Verlag GmbH.
Towards a more efficient representation of imputation operators in TPOT
Garciarena, Unai; Mendiburu, Alexander; Santana, Roberto
2018-01-01
Automated Machine Learning encompasses a set of meta-algorithms intended to design and apply machine learning techniques (e.g., model selection, hyperparameter tuning, model assessment, etc.). TPOT, a software for optimizing machine learning pipelines based on genetic programming (GP), is a novel example of this kind of applications. Recently we have proposed a way to introduce imputation methods as part of TPOT. While our approach was able to deal with problems with missing data, it can prod...
DTW-APPROACH FOR UNCORRELATED MULTIVARIATE TIME SERIES IMPUTATION
Phan , Thi-Thu-Hong; Poisson Caillault , Emilie; Bigand , André; Lefebvre , Alain
2017-01-01
International audience; Missing data are inevitable in almost domains of applied sciences. Data analysis with missing values can lead to a loss of efficiency and unreliable results, especially for large missing sub-sequence(s). Some well-known methods for multivariate time series imputation require high correlations between series or their features. In this paper , we propose an approach based on the shape-behaviour relation in low/un-correlated multivariate time series under an assumption of...
An improved partial least-squares regression method for Raman spectroscopy
Momenpour Tehran Monfared, Ali; Anis, Hanan
2017-10-01
It is known that the performance of partial least-squares (PLS) regression analysis can be improved using the backward variable selection method (BVSPLS). In this paper, we further improve the BVSPLS based on a novel selection mechanism. The proposed method is based on sorting the weighted regression coefficients, and then the importance of each variable of the sorted list is evaluated using root mean square errors of prediction (RMSEP) criterion in each iteration step. Our Improved BVSPLS (IBVSPLS) method has been applied to leukemia and heparin data sets and led to an improvement in limit of detection of Raman biosensing ranged from 10% to 43% compared to PLS. Our IBVSPLS was also compared to the jack-knifing (simpler) and Genetic Algorithm (more complex) methods. Our method was consistently better than the jack-knifing method and showed either a similar or a better performance compared to the genetic algorithm.
Wang, Jiangbo; Liu, Junhui; Li, Tiantian; Yin, Shuo; He, Xinhui
2018-01-01
The monthly electricity sales forecasting is a basic work to ensure the safety of the power system. This paper presented a monthly electricity sales forecasting method which comprehensively considers the coupled multi-factors of temperature, economic growth, electric power replacement and business expansion. The mathematical model is constructed by using regression method. The simulation results show that the proposed method is accurate and effective.
International Nuclear Information System (INIS)
Shuke, Noriyuki
1991-01-01
In hepatobiliary scintigraphy, kinetic model analysis, which provides kinetic parameters like hepatic extraction or excretion rate, have been done for quantitative evaluation of liver function. In this analysis, unknown model parameters are usually determined using nonlinear least square regression method (NLS method) where iterative calculation and initial estimate for unknown parameters are required. As a simple alternative to NLS method, direct integral linear least square regression method (DILS method), which can determine model parameters by a simple calculation without initial estimate, is proposed, and tested the applicability to analysis of hepatobiliary scintigraphy. In order to see whether DILS method could determine model parameters as good as NLS method, or to determine appropriate weight for DILS method, simulated theoretical data based on prefixed parameters were fitted to 1 compartment model using both DILS method with various weightings and NLS method. The parameter values obtained were then compared with prefixed values which were used for data generation. The effect of various weights on the error of parameter estimate was examined, and inverse of time was found to be the best weight to make the error minimum. When using this weight, DILS method could give parameter values close to those obtained by NLS method and both parameter values were very close to prefixed values. With appropriate weighting, the DILS method could provide reliable parameter estimate which is relatively insensitive to the data noise. In conclusion, the DILS method could be used as a simple alternative to NLS method, providing reliable parameter estimate. (author)
A different approach to estimate nonlinear regression model using numerical methods
Mahaboob, B.; Venkateswarlu, B.; Mokeshrayalu, G.; Balasiddamuni, P.
2017-11-01
This research paper concerns with the computational methods namely the Gauss-Newton method, Gradient algorithm methods (Newton-Raphson method, Steepest Descent or Steepest Ascent algorithm method, the Method of Scoring, the Method of Quadratic Hill-Climbing) based on numerical analysis to estimate parameters of nonlinear regression model in a very different way. Principles of matrix calculus have been used to discuss the Gradient-Algorithm methods. Yonathan Bard [1] discussed a comparison of gradient methods for the solution of nonlinear parameter estimation problems. However this article discusses an analytical approach to the gradient algorithm methods in a different way. This paper describes a new iterative technique namely Gauss-Newton method which differs from the iterative technique proposed by Gorden K. Smyth [2]. Hans Georg Bock et.al [10] proposed numerical methods for parameter estimation in DAE’s (Differential algebraic equation). Isabel Reis Dos Santos et al [11], Introduced weighted least squares procedure for estimating the unknown parameters of a nonlinear regression metamodel. For large-scale non smooth convex minimization the Hager and Zhang (HZ) conjugate gradient Method and the modified HZ (MHZ) method were presented by Gonglin Yuan et al [12].
Regression dilution bias: tools for correction methods and sample size calculation.
Berglund, Lars
2012-08-01
Random errors in measurement of a risk factor will introduce downward bias of an estimated association to a disease or a disease marker. This phenomenon is called regression dilution bias. A bias correction may be made with data from a validity study or a reliability study. In this article we give a non-technical description of designs of reliability studies with emphasis on selection of individuals for a repeated measurement, assumptions of measurement error models, and correction methods for the slope in a simple linear regression model where the dependent variable is a continuous variable. Also, we describe situations where correction for regression dilution bias is not appropriate. The methods are illustrated with the association between insulin sensitivity measured with the euglycaemic insulin clamp technique and fasting insulin, where measurement of the latter variable carries noticeable random error. We provide software tools for estimation of a corrected slope in a simple linear regression model assuming data for a continuous dependent variable and a continuous risk factor from a main study and an additional measurement of the risk factor in a reliability study. Also, we supply programs for estimation of the number of individuals needed in the reliability study and for choice of its design. Our conclusion is that correction for regression dilution bias is seldom applied in epidemiological studies. This may cause important effects of risk factors with large measurement errors to be neglected.
Data driven estimation of imputation error-a strategy for imputation with a reject option
DEFF Research Database (Denmark)
Bak, Nikolaj; Hansen, Lars Kai
2016-01-01
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values i...
Directory of Open Access Journals (Sweden)
Yi-Ming Kuo
2011-06-01
Full Text Available Fine airborne particulate matter (PM2.5 has adverse effects on human health. Assessing the long-term effects of PM2.5 exposure on human health and ecology is often limited by a lack of reliable PM2.5 measurements. In Taipei, PM2.5 levels were not systematically measured until August, 2005. Due to the popularity of geographic information systems (GIS, the landuse regression method has been widely used in the spatial estimation of PM concentrations. This method accounts for the potential contributing factors of the local environment, such as traffic volume. Geostatistical methods, on other hand, account for the spatiotemporal dependence among the observations of ambient pollutants. This study assesses the performance of the landuse regression model for the spatiotemporal estimation of PM2.5 in the Taipei area. Specifically, this study integrates the landuse regression model with the geostatistical approach within the framework of the Bayesian maximum entropy (BME method. The resulting epistemic framework can assimilate knowledge bases including: (a empirical-based spatial trends of PM concentration based on landuse regression, (b the spatio-temporal dependence among PM observation information, and (c site-specific PM observations. The proposed approach performs the spatiotemporal estimation of PM2.5 levels in the Taipei area (Taiwan from 2005–2007.
Yu, Hwa-Lung; Wang, Chih-Hsih; Liu, Ming-Che; Kuo, Yi-Ming
2011-06-01
Fine airborne particulate matter (PM2.5) has adverse effects on human health. Assessing the long-term effects of PM2.5 exposure on human health and ecology is often limited by a lack of reliable PM2.5 measurements. In Taipei, PM2.5 levels were not systematically measured until August, 2005. Due to the popularity of geographic information systems (GIS), the landuse regression method has been widely used in the spatial estimation of PM concentrations. This method accounts for the potential contributing factors of the local environment, such as traffic volume. Geostatistical methods, on other hand, account for the spatiotemporal dependence among the observations of ambient pollutants. This study assesses the performance of the landuse regression model for the spatiotemporal estimation of PM2.5 in the Taipei area. Specifically, this study integrates the landuse regression model with the geostatistical approach within the framework of the Bayesian maximum entropy (BME) method. The resulting epistemic framework can assimilate knowledge bases including: (a) empirical-based spatial trends of PM concentration based on landuse regression, (b) the spatio-temporal dependence among PM observation information, and (c) site-specific PM observations. The proposed approach performs the spatiotemporal estimation of PM2.5 levels in the Taipei area (Taiwan) from 2005-2007.
A Simple and Convenient Method of Multiple Linear Regression to Calculate Iodine Molecular Constants
Cooper, Paul D.
2010-01-01
A new procedure using a student-friendly least-squares multiple linear-regression technique utilizing a function within Microsoft Excel is described that enables students to calculate molecular constants from the vibronic spectrum of iodine. This method is advantageous pedagogically as it calculates molecular constants for ground and excited…
Cox regression with missing covariate data using a modified partial likelihood method
DEFF Research Database (Denmark)
Martinussen, Torben; Holst, Klaus K.; Scheike, Thomas H.
2016-01-01
Missing covariate values is a common problem in survival analysis. In this paper we propose a novel method for the Cox regression model that is close to maximum likelihood but avoids the use of the EM-algorithm. It exploits that the observed hazard function is multiplicative in the baseline hazard...
Convert a low-cost sensor to a colorimeter using an improved regression method
Wu, Yifeng
2008-01-01
Closed loop color calibration is a process to maintain consistent color reproduction for color printers. To perform closed loop color calibration, a pre-designed color target should be printed, and automatically measured by a color measuring instrument. A low cost sensor has been embedded to the printer to perform the color measurement. A series of sensor calibration and color conversion methods have been developed. The purpose is to get accurate colorimetric measurement from the data measured by the low cost sensor. In order to get high accuracy colorimetric measurement, we need carefully calibrate the sensor, and minimize all possible errors during the color conversion. After comparing several classical color conversion methods, a regression based color conversion method has been selected. The regression is a powerful method to estimate the color conversion functions. But the main difficulty to use this method is to find an appropriate function to describe the relationship between the input and the output data. In this paper, we propose to use 1D pre-linearization tables to improve the linearity between the input sensor measuring data and the output colorimetric data. Using this method, we can increase the accuracy of the regression method, so as to improve the accuracy of the color conversion.
Improving accuracy of rare variant imputation with a two-step imputation approach
DEFF Research Database (Denmark)
Kreiner-Møller, Eskil; Medina-Gomez, Carolina; Uitterlinden, André G
2015-01-01
not being comprehensively scrutinized. Next-generation arrays ensuring sufficient coverage together with new reference panels, as the 1000 Genomes panel, are emerging to facilitate imputation of low frequent single-nucleotide polymorphisms (minor allele frequency (MAF) ... reference sample genotyped on a dense array and hereafter to the 1000 Genomes reference panel. We show that mean imputation quality, measured by the r(2) using this approach, increases by 28% for variants with a MAF between 1 and 5% as compared with direct imputation to 1000 Genomes reference. Similarly......Genotype imputation has been the pillar of the success of genome-wide association studies (GWAS) for identifying common variants associated with common diseases. However, most GWAS have been run using only 60 HapMap samples as reference for imputation, meaning less frequent and rare variants...
Sidik, S. M.
1975-01-01
Ridge, Marquardt's generalized inverse, shrunken, and principal components estimators are discussed in terms of the objectives of point estimation of parameters, estimation of the predictive regression function, and hypothesis testing. It is found that as the normal equations approach singularity, more consideration must be given to estimable functions of the parameters as opposed to estimation of the full parameter vector; that biased estimators all introduce constraints on the parameter space; that adoption of mean squared error as a criterion of goodness should be independent of the degree of singularity; and that ordinary least-squares subset regression is the best overall method.
Anderson, Carl A.; McRae, Allan F.; Visscher, Peter M.
2006-01-01
Standard quantitative trait loci (QTL) mapping techniques commonly assume that the trait is both fully observed and normally distributed. When considering survival or age-at-onset traits these assumptions are often incorrect. Methods have been developed to map QTL for survival traits; however, they are both computationally intensive and not available in standard genome analysis software packages. We propose a grouped linear regression method for the analysis of continuous survival data. Using...
A Comparative Study of Pairwise Learning Methods Based on Kernel Ridge Regression.
Stock, Michiel; Pahikkala, Tapio; Airola, Antti; De Baets, Bernard; Waegeman, Willem
2018-06-12
Many machine learning problems can be formulated as predicting labels for a pair of objects. Problems of that kind are often referred to as pairwise learning, dyadic prediction, or network inference problems. During the past decade, kernel methods have played a dominant role in pairwise learning. They still obtain a state-of-the-art predictive performance, but a theoretical analysis of their behavior has been underexplored in the machine learning literature. In this work we review and unify kernel-based algorithms that are commonly used in different pairwise learning settings, ranging from matrix filtering to zero-shot learning. To this end, we focus on closed-form efficient instantiations of Kronecker kernel ridge regression. We show that independent task kernel ridge regression, two-step kernel ridge regression, and a linear matrix filter arise naturally as a special case of Kronecker kernel ridge regression, implying that all these methods implicitly minimize a squared loss. In addition, we analyze universality, consistency, and spectral filtering properties. Our theoretical results provide valuable insights into assessing the advantages and limitations of existing pairwise learning methods.
Estimation Methods for Non-Homogeneous Regression - Minimum CRPS vs Maximum Likelihood
Gebetsberger, Manuel; Messner, Jakob W.; Mayr, Georg J.; Zeileis, Achim
2017-04-01
Non-homogeneous regression models are widely used to statistically post-process numerical weather prediction models. Such regression models correct for errors in mean and variance and are capable to forecast a full probability distribution. In order to estimate the corresponding regression coefficients, CRPS minimization is performed in many meteorological post-processing studies since the last decade. In contrast to maximum likelihood estimation, CRPS minimization is claimed to yield more calibrated forecasts. Theoretically, both scoring rules used as an optimization score should be able to locate a similar and unknown optimum. Discrepancies might result from a wrong distributional assumption of the observed quantity. To address this theoretical concept, this study compares maximum likelihood and minimum CRPS estimation for different distributional assumptions. First, a synthetic case study shows that, for an appropriate distributional assumption, both estimation methods yield to similar regression coefficients. The log-likelihood estimator is slightly more efficient. A real world case study for surface temperature forecasts at different sites in Europe confirms these results but shows that surface temperature does not always follow the classical assumption of a Gaussian distribution. KEYWORDS: ensemble post-processing, maximum likelihood estimation, CRPS minimization, probabilistic temperature forecasting, distributional regression models
Adikaram, K K L B; Hussein, M A; Effenberger, M; Becker, T
2015-01-01
Data processing requires a robust linear fit identification method. In this paper, we introduce a non-parametric robust linear fit identification method for time series. The method uses an indicator 2/n to identify linear fit, where n is number of terms in a series. The ratio Rmax of amax - amin and Sn - amin*n and that of Rmin of amax - amin and amax*n - Sn are always equal to 2/n, where amax is the maximum element, amin is the minimum element and Sn is the sum of all elements. If any series expected to follow y = c consists of data that do not agree with y = c form, Rmax > 2/n and Rmin > 2/n imply that the maximum and minimum elements, respectively, do not agree with linear fit. We define threshold values for outliers and noise detection as 2/n * (1 + k1) and 2/n * (1 + k2), respectively, where k1 > k2 and 0 ≤ k1 ≤ n/2 - 1. Given this relation and transformation technique, which transforms data into the form y = c, we show that removing all data that do not agree with linear fit is possible. Furthermore, the method is independent of the number of data points, missing data, removed data points and nature of distribution (Gaussian or non-Gaussian) of outliers, noise and clean data. These are major advantages over the existing linear fit methods. Since having a perfect linear relation between two variables in the real world is impossible, we used artificial data sets with extreme conditions to verify the method. The method detects the correct linear fit when the percentage of data agreeing with linear fit is less than 50%, and the deviation of data that do not agree with linear fit is very small, of the order of ±10-4%. The method results in incorrect detections only when numerical accuracy is insufficient in the calculation process.
Directory of Open Access Journals (Sweden)
K K L B Adikaram
Full Text Available Data processing requires a robust linear fit identification method. In this paper, we introduce a non-parametric robust linear fit identification method for time series. The method uses an indicator 2/n to identify linear fit, where n is number of terms in a series. The ratio Rmax of amax - amin and Sn - amin*n and that of Rmin of amax - amin and amax*n - Sn are always equal to 2/n, where amax is the maximum element, amin is the minimum element and Sn is the sum of all elements. If any series expected to follow y = c consists of data that do not agree with y = c form, Rmax > 2/n and Rmin > 2/n imply that the maximum and minimum elements, respectively, do not agree with linear fit. We define threshold values for outliers and noise detection as 2/n * (1 + k1 and 2/n * (1 + k2, respectively, where k1 > k2 and 0 ≤ k1 ≤ n/2 - 1. Given this relation and transformation technique, which transforms data into the form y = c, we show that removing all data that do not agree with linear fit is possible. Furthermore, the method is independent of the number of data points, missing data, removed data points and nature of distribution (Gaussian or non-Gaussian of outliers, noise and clean data. These are major advantages over the existing linear fit methods. Since having a perfect linear relation between two variables in the real world is impossible, we used artificial data sets with extreme conditions to verify the method. The method detects the correct linear fit when the percentage of data agreeing with linear fit is less than 50%, and the deviation of data that do not agree with linear fit is very small, of the order of ±10-4%. The method results in incorrect detections only when numerical accuracy is insufficient in the calculation process.
A Fast Gradient Method for Nonnegative Sparse Regression With Self-Dictionary
Gillis, Nicolas; Luce, Robert
2018-01-01
A nonnegative matrix factorization (NMF) can be computed efficiently under the separability assumption, which asserts that all the columns of the given input data matrix belong to the cone generated by a (small) subset of them. The provably most robust methods to identify these conic basis columns are based on nonnegative sparse regression and self dictionaries, and require the solution of large-scale convex optimization problems. In this paper we study a particular nonnegative sparse regression model with self dictionary. As opposed to previously proposed models, this model yields a smooth optimization problem where the sparsity is enforced through linear constraints. We show that the Euclidean projection on the polyhedron defined by these constraints can be computed efficiently, and propose a fast gradient method to solve our model. We compare our algorithm with several state-of-the-art methods on synthetic data sets and real-world hyperspectral images.
Van Belle, Vanya; Pelckmans, Kristiaan; Van Huffel, Sabine; Suykens, Johan A K
2011-10-01
To compare and evaluate ranking, regression and combined machine learning approaches for the analysis of survival data. The literature describes two approaches based on support vector machines to deal with censored observations. In the first approach the key idea is to rephrase the task as a ranking problem via the concordance index, a problem which can be solved efficiently in a context of structural risk minimization and convex optimization techniques. In a second approach, one uses a regression approach, dealing with censoring by means of inequality constraints. The goal of this paper is then twofold: (i) introducing a new model combining the ranking and regression strategy, which retains the link with existing survival models such as the proportional hazards model via transformation models; and (ii) comparison of the three techniques on 6 clinical and 3 high-dimensional datasets and discussing the relevance of these techniques over classical approaches fur survival data. We compare svm-based survival models based on ranking constraints, based on regression constraints and models based on both ranking and regression constraints. The performance of the models is compared by means of three different measures: (i) the concordance index, measuring the model's discriminating ability; (ii) the logrank test statistic, indicating whether patients with a prognostic index lower than the median prognostic index have a significant different survival than patients with a prognostic index higher than the median; and (iii) the hazard ratio after normalization to restrict the prognostic index between 0 and 1. Our results indicate a significantly better performance for models including regression constraints above models only based on ranking constraints. This work gives empirical evidence that svm-based models using regression constraints perform significantly better than svm-based models based on ranking constraints. Our experiments show a comparable performance for methods
Directory of Open Access Journals (Sweden)
Giuliano de Oliveira Freitas
2013-10-01
Full Text Available PURPOSE: To determine linear regression models between Alpins descriptive indices and Thibos astigmatic power vectors (APV, assessing the validity and strength of such correlations. METHODS: This case series prospectively assessed 62 eyes of 31 consecutive cataract patients with preoperative corneal astigmatism between 0.75 and 2.50 diopters in both eyes. Patients were randomly assorted among two phacoemulsification groups: one assigned to receive AcrySof®Toric intraocular lens (IOL in both eyes and another assigned to have AcrySof Natural IOL associated with limbal relaxing incisions, also in both eyes. All patients were reevaluated postoperatively at 6 months, when refractive astigmatism analysis was performed using both Alpins and Thibos methods. The ratio between Thibos postoperative APV and preoperative APV (APVratio and its linear regression to Alpins percentage of success of astigmatic surgery, percentage of astigmatism corrected and percentage of astigmatism reduction at the intended axis were assessed. RESULTS: Significant negative correlation between the ratio of post- and preoperative Thibos APVratio and Alpins percentage of success (%Success was found (Spearman's ρ=-0.93; linear regression is given by the following equation: %Success = (-APVratio + 1.00x100. CONCLUSION: The linear regression we found between APVratio and %Success permits a validated mathematical inference concerning the overall success of astigmatic surgery.
Using the fuzzy linear regression method to benchmark the energy efficiency of commercial buildings
International Nuclear Information System (INIS)
Chung, William
2012-01-01
Highlights: ► Fuzzy linear regression method is used for developing benchmarking systems. ► The systems can be used to benchmark energy efficiency of commercial buildings. ► The resulting benchmarking model can be used by public users. ► The resulting benchmarking model can capture the fuzzy nature of input–output data. -- Abstract: Benchmarking systems from a sample of reference buildings need to be developed to conduct benchmarking processes for the energy efficiency of commercial buildings. However, not all benchmarking systems can be adopted by public users (i.e., other non-reference building owners) because of the different methods in developing such systems. An approach for benchmarking the energy efficiency of commercial buildings using statistical regression analysis to normalize other factors, such as management performance, was developed in a previous work. However, the field data given by experts can be regarded as a distribution of possibility. Thus, the previous work may not be adequate to handle such fuzzy input–output data. Consequently, a number of fuzzy structures cannot be fully captured by statistical regression analysis. This present paper proposes the use of fuzzy linear regression analysis to develop a benchmarking process, the resulting model of which can be used by public users. An illustrative example is given as well.
Adikaram, K. K. L. B.; Becker, T.
2015-01-01
Data processing requires a robust linear fit identification method. In this paper, we introduce a non-parametric robust linear fit identification method for time series. The method uses an indicator 2/n to identify linear fit, where n is number of terms in a series. The ratio R max of a max − a min and S n − a min *n and that of R min of a max − a min and a max *n − S n are always equal to 2/n, where a max is the maximum element, a min is the minimum element and S n is the sum of all elements. If any series expected to follow y = c consists of data that do not agree with y = c form, R max > 2/n and R min > 2/n imply that the maximum and minimum elements, respectively, do not agree with linear fit. We define threshold values for outliers and noise detection as 2/n * (1 + k 1 ) and 2/n * (1 + k 2 ), respectively, where k 1 > k 2 and 0 ≤ k 1 ≤ n/2 − 1. Given this relation and transformation technique, which transforms data into the form y = c, we show that removing all data that do not agree with linear fit is possible. Furthermore, the method is independent of the number of data points, missing data, removed data points and nature of distribution (Gaussian or non-Gaussian) of outliers, noise and clean data. These are major advantages over the existing linear fit methods. Since having a perfect linear relation between two variables in the real world is impossible, we used artificial data sets with extreme conditions to verify the method. The method detects the correct linear fit when the percentage of data agreeing with linear fit is less than 50%, and the deviation of data that do not agree with linear fit is very small, of the order of ±10−4%. The method results in incorrect detections only when numerical accuracy is insufficient in the calculation process. PMID:26571035
Directory of Open Access Journals (Sweden)
Massoud Tabesh
2011-07-01
Full Text Available Optimum operation of water distribution networks is one of the priorities of sustainable development of water resources, considering the issues of increasing efficiency and decreasing the water losses. One of the key subjects in optimum operational management of water distribution systems is preparing rehabilitation and replacement schemes, prediction of pipes break rate and evaluation of their reliability. Several approaches have been presented in recent years regarding prediction of pipe failure rates which each one requires especial data sets. Deterministic models based on age and deterministic multi variables and stochastic group modeling are examples of the solutions which relate pipe break rates to parameters like age, material and diameters. In this paper besides the mentioned parameters, more factors such as pipe depth and hydraulic pressures are considered as well. Then using multi variable regression method, intelligent approaches (Artificial neural network and neuro fuzzy models and Evolutionary polynomial Regression method (EPR pipe burst rate are predicted. To evaluate the results of different approaches, a case study is carried out in a part ofMashhadwater distribution network. The results show the capability and advantages of ANN and EPR methods to predict pipe break rates, in comparison with neuro fuzzy and multi-variable regression methods.
Kaneko, Hiromasa
2018-02-26
To develop a new ensemble learning method and construct highly predictive regression models in chemoinformatics and chemometrics, applicability domains (ADs) are introduced into the ensemble learning process of prediction. When estimating values of an objective variable using subregression models, only the submodels with ADs that cover a query sample, i.e., the sample is inside the model's AD, are used. By constructing submodels and changing a list of selected explanatory variables, the union of the submodels' ADs, which defines the overall AD, becomes large, and the prediction performance is enhanced for diverse compounds. By analyzing a quantitative structure-activity relationship data set and a quantitative structure-property relationship data set, it is confirmed that the ADs can be enlarged and the estimation performance of regression models is improved compared with traditional methods.
Development of Compressive Failure Strength for Composite Laminate Using Regression Analysis Method
Energy Technology Data Exchange (ETDEWEB)
Lee, Myoung Keon [Agency for Defense Development, Daejeon (Korea, Republic of); Lee, Jeong Won; Yoon, Dong Hyun; Kim, Jae Hoon [Chungnam Nat’l Univ., Daejeon (Korea, Republic of)
2016-10-15
This paper provides the compressive failure strength value of composite laminate developed by using regression analysis method. Composite material in this document is a Carbon/Epoxy unidirection(UD) tape prepreg(Cycom G40-800/5276-1) cured at 350°F(177°C). The operating temperature is –60°F~+200°F(-55°C - +95°C). A total of 56 compression tests were conducted on specimens from eight (8) distinct laminates that were laid up by standard angle layers (0°, +45°, –45° and 90°). The ASTM-D-6484 standard was used for test method. The regression analysis was performed with the response variable being the laminate ultimate fracture strength and the regressor variables being two ply orientations (0° and ±45°)
Development of Compressive Failure Strength for Composite Laminate Using Regression Analysis Method
International Nuclear Information System (INIS)
Lee, Myoung Keon; Lee, Jeong Won; Yoon, Dong Hyun; Kim, Jae Hoon
2016-01-01
This paper provides the compressive failure strength value of composite laminate developed by using regression analysis method. Composite material in this document is a Carbon/Epoxy unidirection(UD) tape prepreg(Cycom G40-800/5276-1) cured at 350°F(177°C). The operating temperature is –60°F~+200°F(-55°C - +95°C). A total of 56 compression tests were conducted on specimens from eight (8) distinct laminates that were laid up by standard angle layers (0°, +45°, –45° and 90°). The ASTM-D-6484 standard was used for test method. The regression analysis was performed with the response variable being the laminate ultimate fracture strength and the regressor variables being two ply orientations (0° and ±45°)
James W. Hardin; Henrik Schmeidiche; Raymond J. Carroll
2003-01-01
This paper discusses and illustrates the method of regression calibration. This is a straightforward technique for fitting models with additive measurement error. We present this discussion in terms of generalized linear models (GLMs) following the notation defined in Hardin and Carroll (2003). Discussion will include specified measurement error, measurement error estimated by replicate error-prone proxies, and measurement error estimated by instrumental variables. The discussion focuses on s...
Assessing the performance of variational methods for mixed logistic regression models
Czech Academy of Sciences Publication Activity Database
Rijmen, F.; Vomlel, Jiří
2008-01-01
Roč. 78, č. 8 (2008), s. 765-779 ISSN 0094-9655 R&D Projects: GA MŠk 1M0572 Grant - others:GA MŠk(CZ) 2C06019 Institutional research plan: CEZ:AV0Z10750506 Keywords : Mixed models * Logistic regression * Variational methods * Lower bound approximation Subject RIV: BB - Applied Statistics, Operational Research Impact factor: 0.353, year: 2008
Comparison of Adaline and Multiple Linear Regression Methods for Rainfall Forecasting
Sutawinaya, IP; Astawa, INGA; Hariyanti, NKD
2018-01-01
Heavy rainfall can cause disaster, therefore need a forecast to predict rainfall intensity. Main factor that cause flooding is there is a high rainfall intensity and it makes the river become overcapacity. This will cause flooding around the area. Rainfall factor is a dynamic factor, so rainfall is very interesting to be studied. In order to support the rainfall forecasting, there are methods that can be used from Artificial Intelligence (AI) to statistic. In this research, we used Adaline for AI method and Regression for statistic method. The more accurate forecast result shows the method that used is good for forecasting the rainfall. Through those methods, we expected which is the best method for rainfall forecasting here.
Directory of Open Access Journals (Sweden)
McElwee Joshua
2009-06-01
Full Text Available Abstract Background Although high-throughput genotyping arrays have made whole-genome association studies (WGAS feasible, only a small proportion of SNPs in the human genome are actually surveyed in such studies. In addition, various SNP arrays assay different sets of SNPs, which leads to challenges in comparing results and merging data for meta-analyses. Genome-wide imputation of untyped markers allows us to address these issues in a direct fashion. Methods 384 Caucasian American liver donors were genotyped using Illumina 650Y (Ilmn650Y arrays, from which we also derived genotypes from the Ilmn317K array. On these data, we compared two imputation methods: MACH and BEAGLE. We imputed 2.5 million HapMap Release22 SNPs, and conducted GWAS on ~40,000 liver mRNA expression traits (eQTL analysis. In addition, 200 Caucasian American and 200 African American subjects were genotyped using the Affymetrix 500 K array plus a custom 164 K fill-in chip. We then imputed the HapMap SNPs and quantified the accuracy by randomly masking observed SNPs. Results MACH and BEAGLE perform similarly with respect to imputation accuracy. The Ilmn650Y results in excellent imputation performance, and it outperforms Affx500K or Ilmn317K sets. For Caucasian Americans, 90% of the HapMap SNPs were imputed at 98% accuracy. As expected, imputation of poorly tagged SNPs (untyped SNPs in weak LD with typed markers was not as successful. It was more challenging to impute genotypes in the African American population, given (1 shorter LD blocks and (2 admixture with Caucasian populations in this population. To address issue (2, we pooled HapMap CEU and YRI data as an imputation reference set, which greatly improved overall performance. The approximate 40,000 phenotypes scored in these populations provide a path to determine empirically how the power to detect associations is affected by the imputation procedures. That is, at a fixed false discovery rate, the number of cis
Hassanzadeh, S.; Hosseinibalam, F.; Omidvari, M.
2008-04-01
Data of seven meteorological variables (relative humidity, wet temperature, dry temperature, maximum temperature, minimum temperature, ground temperature and sun radiation time) and ozone values have been used for statistical analysis. Meteorological variables and ozone values were analyzed using both multiple linear regression and principal component methods. Data for the period 1999-2004 are analyzed jointly using both methods. For all periods, temperature dependent variables were highly correlated, but were all negatively correlated with relative humidity. Multiple regression analysis was used to fit the meteorological variables using the meteorological variables as predictors. A variable selection method based on high loading of varimax rotated principal components was used to obtain subsets of the predictor variables to be included in the linear regression model of the meteorological variables. In 1999, 2001 and 2002 one of the meteorological variables was weakly influenced predominantly by the ozone concentrations. However, the model did not predict that the meteorological variables for the year 2000 were not influenced predominantly by the ozone concentrations that point to variation in sun radiation. This could be due to other factors that were not explicitly considered in this study.
Cost reduction for web-based data imputation
Li, Zhixu; Shang, Shuo; Xie, Qing; Zhang, Xiangliang
2014-01-01
Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity
Resche-Rigon, Matthieu; White, Ian R
2018-06-01
In multilevel settings such as individual participant data meta-analysis, a variable is 'systematically missing' if it is wholly missing in some clusters and 'sporadically missing' if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.
Correcting for cryptic relatedness by a regression-based genomic control method
Directory of Open Access Journals (Sweden)
Yang Yaning
2009-12-01
Full Text Available Abstract Background Genomic control (GC method is a useful tool to correct for the cryptic relatedness in population-based association studies. It was originally proposed for correcting for the variance inflation of Cochran-Armitage's additive trend test by using information from unlinked null markers, and was later generalized to be applicable to other tests with the additional requirement that the null markers are matched with the candidate marker in allele frequencies. However, matching allele frequencies limits the number of available null markers and thus limits the applicability of the GC method. On the other hand, errors in genotype/allele frequencies may cause further bias and variance inflation and thereby aggravate the effect of GC correction. Results In this paper, we propose a regression-based GC method using null markers that are not necessarily matched in allele frequencies with the candidate marker. Variation of allele frequencies of the null markers is adjusted by a regression method. Conclusion The proposed method can be readily applied to the Cochran-Armitage's trend tests other than the additive trend test, the Pearson's chi-square test and other robust efficiency tests. Simulation results show that the proposed method is effective in controlling type I error in the presence of population substructure.
A subagging regression method for estimating the qualitative and quantitative state of groundwater
Jeong, Jina; Park, Eungyu; Han, Weon Shik; Kim, Kue-Young
2017-08-01
A subsample aggregating (subagging) regression (SBR) method for the analysis of groundwater data pertaining to trend-estimation-associated uncertainty is proposed. The SBR method is validated against synthetic data competitively with other conventional robust and non-robust methods. From the results, it is verified that the estimation accuracies of the SBR method are consistent and superior to those of other methods, and the uncertainties are reasonably estimated; the others have no uncertainty analysis option. To validate further, actual groundwater data are employed and analyzed comparatively with Gaussian process regression (GPR). For all cases, the trend and the associated uncertainties are reasonably estimated by both SBR and GPR regardless of Gaussian or non-Gaussian skewed data. However, it is expected that GPR has a limitation in applications to severely corrupted data by outliers owing to its non-robustness. From the implementations, it is determined that the SBR method has the potential to be further developed as an effective tool of anomaly detection or outlier identification in groundwater state data such as the groundwater level and contaminant concentration.
DEFF Research Database (Denmark)
Jørgensen, Anders W.; Lundstrøm, Lars H; Wetterslev, Jørn
2014-01-01
BACKGROUND: In randomised trials of medical interventions, the most reliable analysis follows the intention-to-treat (ITT) principle. However, the ITT analysis requires that missing outcome data have to be imputed. Different imputation techniques may give different results and some may lead to bias...... of handling missing data in a 60-week placebo controlled anti-obesity drug trial on topiramate. METHODS: We compared an analysis of complete cases with datasets where missing body weight measurements had been replaced using three different imputation methods: LOCF, baseline carried forward (BOCF) and MI...
Liu, Siwei; Molenaar, Peter C M
2014-12-01
This article introduces iVAR, an R program for imputing missing data in multivariate time series on the basis of vector autoregressive (VAR) models. We conducted a simulation study to compare iVAR with three methods for handling missing data: listwise deletion, imputation with sample means and variances, and multiple imputation ignoring time dependency. The results showed that iVAR produces better estimates for the cross-lagged coefficients than do the other three methods. We demonstrate the use of iVAR with an empirical example of time series electrodermal activity data and discuss the advantages and limitations of the program.
Impact of regression methods on improved effects of soil structure on soil water retention estimates
Nguyen, Phuong Minh; De Pue, Jan; Le, Khoa Van; Cornelis, Wim
2015-06-01
Increasing the accuracy of pedotransfer functions (PTFs), an indirect method for predicting non-readily available soil features such as soil water retention characteristics (SWRC), is of crucial importance for large scale agro-hydrological modeling. Adding significant predictors (i.e., soil structure), and implementing more flexible regression algorithms are among the main strategies of PTFs improvement. The aim of this study was to investigate whether the improved effect of categorical soil structure information on estimating soil-water content at various matric potentials, which has been reported in literature, could be enduringly captured by regression techniques other than the usually applied linear regression. Two data mining techniques, i.e., Support Vector Machines (SVM), and k-Nearest Neighbors (kNN), which have been recently introduced as promising tools for PTF development, were utilized to test if the incorporation of soil structure will improve PTF's accuracy under a context of rather limited training data. The results show that incorporating descriptive soil structure information, i.e., massive, structured and structureless, as grouping criterion can improve the accuracy of PTFs derived by SVM approach in the range of matric potential of -6 to -33 kPa (average RMSE decreased up to 0.005 m3 m-3 after grouping, depending on matric potentials). The improvement was primarily attributed to the outperformance of SVM-PTFs calibrated on structureless soils. No improvement was obtained with kNN technique, at least not in our study in which the data set became limited in size after grouping. Since there is an impact of regression techniques on the improved effect of incorporating qualitative soil structure information, selecting a proper technique will help to maximize the combined influence of flexible regression algorithms and soil structure information on PTF accuracy.
A method for fitting regression splines with varying polynomial order in the linear mixed model.
Edwards, Lloyd J; Stewart, Paul W; MacDougall, James E; Helms, Ronald W
2006-02-15
The linear mixed model has become a widely used tool for longitudinal analysis of continuous variables. The use of regression splines in these models offers the analyst additional flexibility in the formulation of descriptive analyses, exploratory analyses and hypothesis-driven confirmatory analyses. We propose a method for fitting piecewise polynomial regression splines with varying polynomial order in the fixed effects and/or random effects of the linear mixed model. The polynomial segments are explicitly constrained by side conditions for continuity and some smoothness at the points where they join. By using a reparameterization of this explicitly constrained linear mixed model, an implicitly constrained linear mixed model is constructed that simplifies implementation of fixed-knot regression splines. The proposed approach is relatively simple, handles splines in one variable or multiple variables, and can be easily programmed using existing commercial software such as SAS or S-plus. The method is illustrated using two examples: an analysis of longitudinal viral load data from a study of subjects with acute HIV-1 infection and an analysis of 24-hour ambulatory blood pressure profiles.
Prastuti, M.; Suhartono; Salehah, NA
2018-04-01
The need for energy supply, especially for electricity in Indonesia has been increasing in the last past years. Furthermore, the high electricity usage by people at different times leads to the occurrence of heteroscedasticity issue. Estimate the electricity supply that could fulfilled the community’s need is very important, but the heteroscedasticity issue often made electricity forecasting hard to be done. An accurate forecast of electricity consumptions is one of the key challenges for energy provider to make better resources and service planning and also take control actions in order to balance the electricity supply and demand for community. In this paper, hybrid ARIMAX Quantile Regression (ARIMAX-QR) approach was proposed to predict the short-term electricity consumption in East Java. This method will also be compared to time series regression using RMSE, MAPE, and MdAPE criteria. The data used in this research was the electricity consumption per half-an-hour data during the period of September 2015 to April 2016. The results show that the proposed approach can be a competitive alternative to forecast short-term electricity in East Java. ARIMAX-QR using lag values and dummy variables as predictors yield more accurate prediction in both in-sample and out-sample data. Moreover, both time series regression and ARIMAX-QR methods with addition of lag values as predictor could capture accurately the patterns in the data. Hence, it produces better predictions compared to the models that not use additional lag variables.
A robust and efficient stepwise regression method for building sparse polynomial chaos expansions
Energy Technology Data Exchange (ETDEWEB)
Abraham, Simon, E-mail: Simon.Abraham@ulb.ac.be [Vrije Universiteit Brussel (VUB), Department of Mechanical Engineering, Research Group Fluid Mechanics and Thermodynamics, Pleinlaan 2, 1050 Brussels (Belgium); Raisee, Mehrdad [School of Mechanical Engineering, College of Engineering, University of Tehran, P.O. Box: 11155-4563, Tehran (Iran, Islamic Republic of); Ghorbaniasl, Ghader; Contino, Francesco; Lacor, Chris [Vrije Universiteit Brussel (VUB), Department of Mechanical Engineering, Research Group Fluid Mechanics and Thermodynamics, Pleinlaan 2, 1050 Brussels (Belgium)
2017-03-01
Polynomial Chaos (PC) expansions are widely used in various engineering fields for quantifying uncertainties arising from uncertain parameters. The computational cost of classical PC solution schemes is unaffordable as the number of deterministic simulations to be calculated grows dramatically with the number of stochastic dimension. This considerably restricts the practical use of PC at the industrial level. A common approach to address such problems is to make use of sparse PC expansions. This paper presents a non-intrusive regression-based method for building sparse PC expansions. The most important PC contributions are detected sequentially through an automatic search procedure. The variable selection criterion is based on efficient tools relevant to probabilistic method. Two benchmark analytical functions are used to validate the proposed algorithm. The computational efficiency of the method is then illustrated by a more realistic CFD application, consisting of the non-deterministic flow around a transonic airfoil subject to geometrical uncertainties. To assess the performance of the developed methodology, a detailed comparison is made with the well established LAR-based selection technique. The results show that the developed sparse regression technique is able to identify the most significant PC contributions describing the problem. Moreover, the most important stochastic features are captured at a reduced computational cost compared to the LAR method. The results also demonstrate the superior robustness of the method by repeating the analyses using random experimental designs.
Kim, Yoonsang; Choi, Young-Ku; Emery, Sherry
2013-08-01
Several statistical packages are capable of estimating generalized linear mixed models and these packages provide one or more of three estimation methods: penalized quasi-likelihood, Laplace, and Gauss-Hermite. Many studies have investigated these methods' performance for the mixed-effects logistic regression model. However, the authors focused on models with one or two random effects and assumed a simple covariance structure between them, which may not be realistic. When there are multiple correlated random effects in a model, the computation becomes intensive, and often an algorithm fails to converge. Moreover, in our analysis of smoking status and exposure to anti-tobacco advertisements, we have observed that when a model included multiple random effects, parameter estimates varied considerably from one statistical package to another even when using the same estimation method. This article presents a comprehensive review of the advantages and disadvantages of each estimation method. In addition, we compare the performances of the three methods across statistical packages via simulation, which involves two- and three-level logistic regression models with at least three correlated random effects. We apply our findings to a real dataset. Our results suggest that two packages-SAS GLIMMIX Laplace and SuperMix Gaussian quadrature-perform well in terms of accuracy, precision, convergence rates, and computing speed. We also discuss the strengths and weaknesses of the two packages in regard to sample sizes.
Time Series Imputation via L1 Norm-Based Singular Spectrum Analysis
Kalantari, Mahdi; Yarmohammadi, Masoud; Hassani, Hossein; Silva, Emmanuel Sirimal
Missing values in time series data is a well-known and important problem which many researchers have studied extensively in various fields. In this paper, a new nonparametric approach for missing value imputation in time series is proposed. The main novelty of this research is applying the L1 norm-based version of Singular Spectrum Analysis (SSA), namely L1-SSA which is robust against outliers. The performance of the new imputation method has been compared with many other established methods. The comparison is done by applying them to various real and simulated time series. The obtained results confirm that the SSA-based methods, especially L1-SSA can provide better imputation in comparison to other methods.
Kim, Yoonsang; Emery, Sherry
2013-01-01
Several statistical packages are capable of estimating generalized linear mixed models and these packages provide one or more of three estimation methods: penalized quasi-likelihood, Laplace, and Gauss-Hermite. Many studies have investigated these methods’ performance for the mixed-effects logistic regression model. However, the authors focused on models with one or two random effects and assumed a simple covariance structure between them, which may not be realistic. When there are multiple correlated random effects in a model, the computation becomes intensive, and often an algorithm fails to converge. Moreover, in our analysis of smoking status and exposure to anti-tobacco advertisements, we have observed that when a model included multiple random effects, parameter estimates varied considerably from one statistical package to another even when using the same estimation method. This article presents a comprehensive review of the advantages and disadvantages of each estimation method. In addition, we compare the performances of the three methods across statistical packages via simulation, which involves two- and three-level logistic regression models with at least three correlated random effects. We apply our findings to a real dataset. Our results suggest that two packages—SAS GLIMMIX Laplace and SuperMix Gaussian quadrature—perform well in terms of accuracy, precision, convergence rates, and computing speed. We also discuss the strengths and weaknesses of the two packages in regard to sample sizes. PMID:24288415
Directory of Open Access Journals (Sweden)
Guan Lian
2018-01-01
Full Text Available Accurate prediction of taxi-out time is significant precondition for improving the operationality of the departure process at an airport, as well as reducing the long taxi-out time, congestion, and excessive emission of greenhouse gases. Unfortunately, several of the traditional methods of predicting taxi-out time perform unsatisfactorily at congested airports. This paper describes and tests three of those conventional methods which include Generalized Linear Model, Softmax Regression Model, and Artificial Neural Network method and two improved Support Vector Regression (SVR approaches based on swarm intelligence algorithm optimization, which include Particle Swarm Optimization (PSO and Firefly Algorithm. In order to improve the global searching ability of Firefly Algorithm, adaptive step factor and Lévy flight are implemented simultaneously when updating the location function. Six factors are analysed, of which delay is identified as one significant factor in congested airports. Through a series of specific dynamic analyses, a case study of Beijing International Airport (PEK is tested with historical data. The performance measures show that the proposed two SVR approaches, especially the Improved Firefly Algorithm (IFA optimization-based SVR method, not only perform as the best modelling measures and accuracy rate compared with the representative forecast models, but also can achieve a better predictive performance when dealing with abnormal taxi-out time states.
da Silva, Claudia Pereira; Emídio, Elissandro Soares; de Marchi, Mary Rosa Rodrigues
2015-01-01
This paper describes the validation of a method consisting of solid-phase extraction followed by gas chromatography-tandem mass spectrometry for the analysis of the ultraviolet (UV) filters benzophenone-3, ethylhexyl salicylate, ethylhexyl methoxycinnamate and octocrylene. The method validation criteria included evaluation of selectivity, analytical curve, trueness, precision, limits of detection and limits of quantification. The non-weighted linear regression model has traditionally been used for calibration, but it is not necessarily the optimal model in all cases. Because the assumption of homoscedasticity was not met for the analytical data in this work, a weighted least squares linear regression was used for the calibration method. The evaluated analytical parameters were satisfactory for the analytes and showed recoveries at four fortification levels between 62% and 107%, with relative standard deviations less than 14%. The detection limits ranged from 7.6 to 24.1 ng L(-1). The proposed method was used to determine the amount of UV filters in water samples from water treatment plants in Araraquara and Jau in São Paulo, Brazil. Copyright © 2014 Elsevier B.V. All rights reserved.
Zhu, Xiaofeng; Suk, Heung-Il; Wang, Li; Lee, Seong-Whan; Shen, Dinggang
2017-05-01
In this paper, we focus on joint regression and classification for Alzheimer's disease diagnosis and propose a new feature selection method by embedding the relational information inherent in the observations into a sparse multi-task learning framework. Specifically, the relational information includes three kinds of relationships (such as feature-feature relation, response-response relation, and sample-sample relation), for preserving three kinds of the similarity, such as for the features, the response variables, and the samples, respectively. To conduct feature selection, we first formulate the objective function by imposing these three relational characteristics along with an ℓ 2,1 -norm regularization term, and further propose a computationally efficient algorithm to optimize the proposed objective function. With the dimension-reduced data, we train two support vector regression models to predict the clinical scores of ADAS-Cog and MMSE, respectively, and also a support vector classification model to determine the clinical label. We conducted extensive experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset to validate the effectiveness of the proposed method. Our experimental results showed the efficacy of the proposed method in enhancing the performances of both clinical scores prediction and disease status identification, compared to the state-of-the-art methods. Copyright © 2015 Elsevier B.V. All rights reserved.
Directory of Open Access Journals (Sweden)
Gholam Reza Sheykhzadeh
2017-02-01
Full Text Available Introduction: Penetration resistance is one of the criteria for evaluating soil compaction. It correlates with several soil properties such as vehicle trafficability, resistance to root penetration, seedling emergence, and soil compaction by farm machinery. Direct measurement of penetration resistance is time consuming and difficult because of high temporal and spatial variability. Therefore, many different regressions and artificial neural network pedotransfer functions have been proposed to estimate penetration resistance from readily available soil variables such as particle size distribution, bulk density (Db and gravimetric water content (θm. The lands of Ardabil Province are one of the main production regions of potato in Iran, thus, obtaining the soil penetration resistance in these regions help with the management of potato production. The objective of this research was to derive pedotransfer functions by using regression and artificial neural network to predict penetration resistance from some soil variations in the agricultural soils of Ardabil plain and to compare the performance of artificial neural network with regression models. Materials and methods: Disturbed and undisturbed soil samples (n= 105 were systematically taken from 0-10 cm soil depth with nearly 3000 m distance in the agricultural lands of the Ardabil plain ((lat 38°15' to 38°40' N, long 48°16' to 48°61' E. The contents of sand, silt and clay (hydrometer method, CaCO3 (titration method, bulk density (cylinder method, particle density (Dp (pychnometer method, organic carbon (wet oxidation method, total porosity(calculating from Db and Dp, saturated (θs and field soil water (θf using the gravimetric method were measured in the laboratory. Mean geometric diameter (dg and standard deviation (σg of soil particles were computed using the percentages of sand, silt and clay. Penetration resistance was measured in situ using cone penetrometer (analog model at 10
Landslide susceptibility mapping on a global scale using the method of logistic regression
Directory of Open Access Journals (Sweden)
L. Lin
2017-08-01
Full Text Available This paper proposes a statistical model for mapping global landslide susceptibility based on logistic regression. After investigating explanatory factors for landslides in the existing literature, five factors were selected for model landslide susceptibility: relative relief, extreme precipitation, lithology, ground motion and soil moisture. When building the model, 70 % of landslide and nonlandslide points were randomly selected for logistic regression, and the others were used for model validation. To evaluate the accuracy of predictive models, this paper adopts several criteria including a receiver operating characteristic (ROC curve method. Logistic regression experiments found all five factors to be significant in explaining landslide occurrence on a global scale. During the modeling process, percentage correct in confusion matrix of landslide classification was approximately 80 % and the area under the curve (AUC was nearly 0.87. During the validation process, the above statistics were about 81 % and 0.88, respectively. Such a result indicates that the model has strong robustness and stable performance. This model found that at a global scale, soil moisture can be dominant in the occurrence of landslides and topographic factor may be secondary.
Liu, Ke; Chen, Xiaojing; Li, Limin; Chen, Huiling; Ruan, Xiukai; Liu, Wenbin
2015-02-09
The successive projections algorithm (SPA) is widely used to select variables for multiple linear regression (MLR) modeling. However, SPA used only once may not obtain all the useful information of the full spectra, because the number of selected variables cannot exceed the number of calibration samples in the SPA algorithm. Therefore, the SPA-MLR method risks the loss of useful information. To make a full use of the useful information in the spectra, a new method named "consensus SPA-MLR" (C-SPA-MLR) is proposed herein. This method is the combination of consensus strategy and SPA-MLR method. In the C-SPA-MLR method, SPA-MLR is used to construct member models with different subsets of variables, which are selected from the remaining variables iteratively. A consensus prediction is obtained by combining the predictions of the member models. The proposed method is evaluated by analyzing the near infrared (NIR) spectra of corn and diesel. The results of C-SPA-MLR method showed a better prediction performance compared with the SPA-MLR and full-spectra PLS methods. Moreover, these results could serve as a reference for combination the consensus strategy and other variable selection methods when analyzing NIR spectra and other spectroscopic techniques. Copyright © 2014 Elsevier B.V. All rights reserved.
Development of K-Nearest Neighbour Regression Method in Forecasting River Stream Flow
Directory of Open Access Journals (Sweden)
Mohammad Azmi
2012-07-01
Full Text Available Different statistical, non-statistical and black-box methods have been used in forecasting processes. Among statistical methods, K-nearest neighbour non-parametric regression method (K-NN due to its natural simplicity and mathematical base is one of the recommended methods for forecasting processes. In this study, K-NN method is explained completely. Besides, development and improvement approaches such as best neighbour estimation, data transformation functions, distance functions and proposed extrapolation method are described. K-NN method in company with its development approaches is used in streamflow forecasting of Zayandeh-Rud Dam upper basin. Comparing between final results of classic K-NN method and modified K-NN (number of neighbour 5, transformation function of Range Scaling, distance function of Mahanalobis and proposed extrapolation method shows that modified K-NN in criteria of goodness of fit, root mean square error, percentage of volume of error and correlation has had performance improvement 45% , 59% and 17% respectively. These results approve necessity of applying mentioned approaches to derive more accurate forecasts.
Comparing the index-flood and multiple-regression methods using L-moments
Malekinezhad, H.; Nachtnebel, H. P.; Klik, A.
In arid and semi-arid regions, the length of records is usually too short to ensure reliable quantile estimates. Comparing index-flood and multiple-regression analyses based on L-moments was the main objective of this study. Factor analysis was applied to determine main influencing variables on flood magnitude. Ward’s cluster and L-moments approaches were applied to several sites in the Namak-Lake basin in central Iran to delineate homogeneous regions based on site characteristics. Homogeneity test was done using L-moments-based measures. Several distributions were fitted to the regional flood data and index-flood and multiple-regression methods as two regional flood frequency methods were compared. The results of factor analysis showed that length of main waterway, compactness coefficient, mean annual precipitation, and mean annual temperature were the main variables affecting flood magnitude. The study area was divided into three regions based on the Ward’s method of clustering approach. The homogeneity test based on L-moments showed that all three regions were acceptably homogeneous. Five distributions were fitted to the annual peak flood data of three homogeneous regions. Using the L-moment ratios and the Z-statistic criteria, GEV distribution was identified as the most robust distribution among five candidate distributions for all the proposed sub-regions of the study area, and in general, it was concluded that the generalised extreme value distribution was the best-fit distribution for every three regions. The relative root mean square error (RRMSE) measure was applied for evaluating the performance of the index-flood and multiple-regression methods in comparison with the curve fitting (plotting position) method. In general, index-flood method gives more reliable estimations for various flood magnitudes of different recurrence intervals. Therefore, this method should be adopted as regional flood frequency method for the study area and the Namak-Lake basin
International Nuclear Information System (INIS)
Wu, Jie; Wang, Jianzhou; Lu, Haiyan; Dong, Yao; Lu, Xiaoxiao
2013-01-01
Highlights: ► The seasonal and trend items of the data series are forecasted separately. ► Seasonal item in the data series is verified by the Kendall τ correlation testing. ► Different regression models are applied to the trend item forecasting. ► We examine the superiority of the combined models by the quartile value comparison. ► Paired-sample T test is utilized to confirm the superiority of the combined models. - Abstract: For an energy-limited economy system, it is crucial to forecast load demand accurately. This paper devotes to 1-week-ahead daily load forecasting approach in which load demand series are predicted by employing the information of days before being similar to that of the forecast day. As well as in many nonlinear systems, seasonal item and trend item are coexisting in load demand datasets. In this paper, the existing of the seasonal item in the load demand data series is firstly verified according to the Kendall τ correlation testing method. Then in the belief of the separate forecasting to the seasonal item and the trend item would improve the forecasting accuracy, hybrid models by combining seasonal exponential adjustment method (SEAM) with the regression methods are proposed in this paper, where SEAM and the regression models are employed to seasonal and trend items forecasting respectively. Comparisons of the quartile values as well as the mean absolute percentage error values demonstrate this forecasting technique can significantly improve the accuracy though models applied to the trend item forecasting are eleven different ones. This superior performance of this separate forecasting technique is further confirmed by the paired-sample T tests
UniFIeD Univariate Frequency-based Imputation for Time Series Data
Friese, Martina; Stork, Jörg; Ramos Guerra, Ricardo; Bartz-Beielstein, Thomas; Thaker, Soham; Flasch, Oliver; Zaefferer, Martin
2013-01-01
This paper introduces UniFIeD, a new data preprocessing method for time series. UniFIeD can cope with large intervals of missing data. A scalable test function generator, which allows the simulation of time series with different gap sizes, is presented additionally. An experimental study demonstrates that (i) UniFIeD shows a significant better performance than simple imputation methods and (ii) UniFIeD is able to handle situations, where advanced imputation methods fail. The results are indep...
Ahmad, Meraj; Sinha, Anubhav; Ghosh, Sreya; Kumar, Vikrant; Davila, Sonia; Yajnik, Chittaranjan S; Chandak, Giriraj R
2017-07-27
Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.
Directory of Open Access Journals (Sweden)
Bangyong Sun
2014-01-01
Full Text Available The polynomial regression method is employed to calculate the relationship of device color space and CIE color space for color characterization, and the performance of different expressions with specific parameters is evaluated. Firstly, the polynomial equation for color conversion is established and the computation of polynomial coefficients is analysed. And then different forms of polynomial equations are used to calculate the RGB and CMYK’s CIE color values, while the corresponding color errors are compared. At last, an optimal polynomial expression is obtained by analysing several related parameters during color conversion, including polynomial numbers, the degree of polynomial terms, the selection of CIE visual spaces, and the linearization.
Face Hallucination with Linear Regression Model in Semi-Orthogonal Multilinear PCA Method
Asavaskulkiet, Krissada
2018-04-01
In this paper, we propose a new face hallucination technique, face images reconstruction in HSV color space with a semi-orthogonal multilinear principal component analysis method. This novel hallucination technique can perform directly from tensors via tensor-to-vector projection by imposing the orthogonality constraint in only one mode. In our experiments, we use facial images from FERET database to test our hallucination approach which is demonstrated by extensive experiments with high-quality hallucinated color faces. The experimental results assure clearly demonstrated that we can generate photorealistic color face images by using the SO-MPCA subspace with a linear regression model.
Real-time prediction of respiratory motion based on local regression methods
International Nuclear Information System (INIS)
Ruan, D; Fessler, J A; Balter, J M
2007-01-01
Recent developments in modulation techniques enable conformal delivery of radiation doses to small, localized target volumes. One of the challenges in using these techniques is real-time tracking and predicting target motion, which is necessary to accommodate system latencies. For image-guided-radiotherapy systems, it is also desirable to minimize sampling rates to reduce imaging dose. This study focuses on predicting respiratory motion, which can significantly affect lung tumours. Predicting respiratory motion in real-time is challenging, due to the complexity of breathing patterns and the many sources of variability. We propose a prediction method based on local regression. There are three major ingredients of this approach: (1) forming an augmented state space to capture system dynamics, (2) local regression in the augmented space to train the predictor from previous observation data using semi-periodicity of respiratory motion, (3) local weighting adjustment to incorporate fading temporal correlations. To evaluate prediction accuracy, we computed the root mean square error between predicted tumor motion and its observed location for ten patients. For comparison, we also investigated commonly used predictive methods, namely linear prediction, neural networks and Kalman filtering to the same data. The proposed method reduced the prediction error for all imaging rates and latency lengths, particularly for long prediction lengths
Local regression type methods applied to the study of geophysics and high frequency financial data
Mariani, M. C.; Basu, K.
2014-09-01
In this work we applied locally weighted scatterplot smoothing techniques (Lowess/Loess) to Geophysical and high frequency financial data. We first analyze and apply this technique to the California earthquake geological data. A spatial analysis was performed to show that the estimation of the earthquake magnitude at a fixed location is very accurate up to the relative error of 0.01%. We also applied the same method to a high frequency data set arising in the financial sector and obtained similar satisfactory results. The application of this approach to the two different data sets demonstrates that the overall method is accurate and efficient, and the Lowess approach is much more desirable than the Loess method. The previous works studied the time series analysis; in this paper our local regression models perform a spatial analysis for the geophysics data providing different information. For the high frequency data, our models estimate the curve of best fit where data are dependent on time.
Directory of Open Access Journals (Sweden)
Ward Judson A
2013-01-01
Full Text Available Abstract Background Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry. Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker density, but result in some genotype errors and a large number of missing genotype values. Imputation can reduce the number of missing values and can correct genotyping errors, but current methods of imputation require a reference genome and thus are not an option for most species. Results Genotyping by Sequencing (GBS was used to produce highly saturated maps for a R. idaeus pseudo-testcross progeny. While low coverage and high variance in sequencing resulted in a large number of missing values for some individuals, a novel method of imputation based on maximum likelihood marker ordering from initial marker segregation overcame the challenge of missing values, and made map construction computationally tractable. The two resulting parental maps contained 4521 and 2391 molecular markers spanning 462.7 and 376.6 cM respectively over seven linkage groups. Detection of precise genomic regions with segregation distortion was possible because of map saturation. Microsatellites (SSRs linked these results to published maps for cross-validation and map comparison. Conclusions GBS together with genome-independent imputation provides a rapid method for genetic map construction in any pseudo-testcross progeny. Our method of imputation estimates the correct genotype call of missing values and corrects genotyping errors that lead to inflated map size and reduced precision in marker placement. Comparison of SSRs to published R. idaeus maps showed that the linkage maps constructed with GBS and our method of imputation were robust, and marker positioning reliable. The high marker density allowed identification of genomic regions with segregation
Using the Superpopulation Model for Imputations and Variance Computation in Survey Sampling
Directory of Open Access Journals (Sweden)
Petr Novák
2012-03-01
Full Text Available This study is aimed at variance computation techniques for estimates of population characteristics based on survey sampling and imputation. We use the superpopulation regression model, which means that the target variable values for each statistical unit are treated as random realizations of a linear regression model with weighted variance. We focus on regression models with one auxiliary variable and no intercept, which have many applications and straightforward interpretation in business statistics. Furthermore, we deal with caseswhere the estimates are not independent and thus the covariance must be computed. We also consider chained regression models with auxiliary variables as random variables instead of constants.
Geographically weighted regression based methods for merging satellite and gauge precipitation
Chao, Lijun; Zhang, Ke; Li, Zhijia; Zhu, Yuelong; Wang, Jingfeng; Yu, Zhongbo
2018-03-01
Real-time precipitation data with high spatiotemporal resolutions are crucial for accurate hydrological forecasting. To improve the spatial resolution and quality of satellite precipitation, a three-step satellite and gauge precipitation merging method was formulated in this study: (1) bilinear interpolation is first applied to downscale coarser satellite precipitation to a finer resolution (PS); (2) the (mixed) geographically weighted regression methods coupled with a weighting function are then used to estimate biases of PS as functions of gauge observations (PO) and PS; and (3) biases of PS are finally corrected to produce a merged precipitation product. Based on the above framework, eight algorithms, a combination of two geographically weighted regression methods and four weighting functions, are developed to merge CMORPH (CPC MORPHing technique) precipitation with station observations on a daily scale in the Ziwuhe Basin of China. The geographical variables (elevation, slope, aspect, surface roughness, and distance to the coastline) and a meteorological variable (wind speed) were used for merging precipitation to avoid the artificial spatial autocorrelation resulting from traditional interpolation methods. The results show that the combination of the MGWR and BI-square function (MGWR-BI) has the best performance (R = 0.863 and RMSE = 7.273 mm/day) among the eight algorithms. The MGWR-BI algorithm was then applied to produce hourly merged precipitation product. Compared to the original CMORPH product (R = 0.208 and RMSE = 1.208 mm/hr), the quality of the merged data is significantly higher (R = 0.724 and RMSE = 0.706 mm/hr). The developed merging method not only improves the spatial resolution and quality of the satellite product but also is easy to implement, which is valuable for hydrological modeling and other applications.
VIGAN: Missing View Imputation with Generative Adversarial Networks.
Shang, Chao; Palmer, Aaron; Sun, Jiangwen; Chen, Ko-Shin; Lu, Jin; Bi, Jinbo
2017-01-01
In an era when big data are becoming the norm, there is less concern with the quantity but more with the quality and completeness of the data. In many disciplines, data are collected from heterogeneous sources, resulting in multi-view or multi-modal datasets. The missing data problem has been challenging to address in multi-view data analysis. Especially, when certain samples miss an entire view of data, it creates the missing view problem. Classic multiple imputations or matrix completion methods are hardly effective here when no information can be based on in the specific view to impute data for such samples. The commonly-used simple method of removing samples with a missing view can dramatically reduce sample size, thus diminishing the statistical power of a subsequent analysis. In this paper, we propose a novel approach for view imputation via generative adversarial networks (GANs), which we name by VIGAN. This approach first treats each view as a separate domain and identifies domain-to-domain mappings via a GAN using randomly-sampled data from each view, and then employs a multi-modal denoising autoencoder (DAE) to reconstruct the missing view from the GAN outputs based on paired data across the views. Then, by optimizing the GAN and DAE jointly, our model enables the knowledge integration for domain mappings and view correspondences to effectively recover the missing view. Empirical results on benchmark datasets validate the VIGAN approach by comparing against the state of the art. The evaluation of VIGAN in a genetic study of substance use disorders further proves the effectiveness and usability of this approach in life science.
Directory of Open Access Journals (Sweden)
Adi Syahputra
2014-03-01
Full Text Available Quantitative structure activity relationship (QSAR for 21 insecticides of phthalamides containing hydrazone (PCH was studied using multiple linear regression (MLR, principle component regression (PCR and artificial neural network (ANN. Five descriptors were included in the model for MLR and ANN analysis, and five latent variables obtained from principle component analysis (PCA were used in PCR analysis. Calculation of descriptors was performed using semi-empirical PM6 method. ANN analysis was found to be superior statistical technique compared to the other methods and gave a good correlation between descriptors and activity (r2 = 0.84. Based on the obtained model, we have successfully designed some new insecticides with higher predicted activity than those of previously synthesized compounds, e.g.2-(decalinecarbamoyl-5-chloro-N’-((5-methylthiophen-2-ylmethylene benzohydrazide, 2-(decalinecarbamoyl-5-chloro-N’-((thiophen-2-yl-methylene benzohydrazide and 2-(decaline carbamoyl-N’-(4-fluorobenzylidene-5-chlorobenzohydrazide with predicted log LC50 of 1.640, 1.672, and 1.769 respectively.
Nonparametric Methods in Astronomy: Think, Regress, Observe—Pick Any Three
Steinhardt, Charles L.; Jermyn, Adam S.
2018-02-01
Telescopes are much more expensive than astronomers, so it is essential to minimize required sample sizes by using the most data-efficient statistical methods possible. However, the most commonly used model-independent techniques for finding the relationship between two variables in astronomy are flawed. In the worst case they can lead without warning to subtly yet catastrophically wrong results, and even in the best case they require more data than necessary. Unfortunately, there is no single best technique for nonparametric regression. Instead, we provide a guide for how astronomers can choose the best method for their specific problem and provide a python library with both wrappers for the most useful existing algorithms and implementations of two new algorithms developed here.
An Application of Robust Method in Multiple Linear Regression Model toward Credit Card Debt
Amira Azmi, Nur; Saifullah Rusiman, Mohd; Khalid, Kamil; Roslan, Rozaini; Sufahani, Suliadi; Mohamad, Mahathir; Salleh, Rohayu Mohd; Hamzah, Nur Shamsidah Amir
2018-04-01
Credit card is a convenient alternative replaced cash or cheque, and it is essential component for electronic and internet commerce. In this study, the researchers attempt to determine the relationship and significance variables between credit card debt and demographic variables such as age, household income, education level, years with current employer, years at current address, debt to income ratio and other debt. The provided data covers 850 customers information. There are three methods that applied to the credit card debt data which are multiple linear regression (MLR) models, MLR models with least quartile difference (LQD) method and MLR models with mean absolute deviation method. After comparing among three methods, it is found that MLR model with LQD method became the best model with the lowest value of mean square error (MSE). According to the final model, it shows that the years with current employer, years at current address, household income in thousands and debt to income ratio are positively associated with the amount of credit debt. Meanwhile variables for age, level of education and other debt are negatively associated with amount of credit debt. This study may serve as a reference for the bank company by using robust methods, so that they could better understand their options and choice that is best aligned with their goals for inference regarding to the credit card debt.
LENUS (Irish Health Repository)
Hardouin, Jean-Benoit
2011-07-14
Abstract Background Nowadays, more and more clinical scales consisting in responses given by the patients to some items (Patient Reported Outcomes - PRO), are validated with models based on Item Response Theory, and more specifically, with a Rasch model. In the validation sample, presence of missing data is frequent. The aim of this paper is to compare sixteen methods for handling the missing data (mainly based on simple imputation) in the context of psychometric validation of PRO by a Rasch model. The main indexes used for validation by a Rasch model are compared. Methods A simulation study was performed allowing to consider several cases, notably the possibility for the missing values to be informative or not and the rate of missing data. Results Several imputations methods produce bias on psychometrical indexes (generally, the imputation methods artificially improve the psychometric qualities of the scale). In particular, this is the case with the method based on the Personal Mean Score (PMS) which is the most commonly used imputation method in practice. Conclusions Several imputation methods should be avoided, in particular PMS imputation. From a general point of view, it is important to use an imputation method that considers both the ability of the patient (measured for example by his\\/her score), and the difficulty of the item (measured for example by its rate of favourable responses). Another recommendation is to always consider the addition of a random process in the imputation method, because such a process allows reducing the bias. Last, the analysis realized without imputation of the missing data (available case analyses) is an interesting alternative to the simple imputation in this context.
Estimating HIES Data through Ratio and Regression Methods for Different Sampling Designs
Directory of Open Access Journals (Sweden)
Faqir Muhammad
2007-01-01
Full Text Available In this study, comparison has been made for different sampling designs, using the HIES data of North West Frontier Province (NWFP for 2001-02 and 1998-99 collected from the Federal Bureau of Statistics, Statistical Division, Government of Pakistan, Islamabad. The performance of the estimators has also been considered using bootstrap and Jacknife. A two-stage stratified random sample design is adopted by HIES. In the first stage, enumeration blocks and villages are treated as the first stage Primary Sampling Units (PSU. The sample PSU’s are selected with probability proportional to size. Secondary Sampling Units (SSU i.e., households are selected by systematic sampling with a random start. They have used a single study variable. We have compared the HIES technique with some other designs, which are: Stratified Simple Random Sampling. Stratified Systematic Sampling. Stratified Ranked Set Sampling. Stratified Two Phase Sampling. Ratio and Regression methods were applied with two study variables, which are: Income (y and Household sizes (x. Jacknife and Bootstrap are used for variance replication. Simple Random Sampling with sample size (462 to 561 gave moderate variances both by Jacknife and Bootstrap. By applying Systematic Sampling, we received moderate variance with sample size (467. In Jacknife with Systematic Sampling, we obtained variance of regression estimator greater than that of ratio estimator for a sample size (467 to 631. At a sample size (952 variance of ratio estimator gets greater than that of regression estimator. The most efficient design comes out to be Ranked set sampling compared with other designs. The Ranked set sampling with jackknife and bootstrap, gives minimum variance even with the smallest sample size (467. Two Phase sampling gave poor performance. Multi-stage sampling applied by HIES gave large variances especially if used with a single study variable.
A suggested approach for imputation of missing dietary data for young children in daycare.
Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P; Zeng, Donglin; Vaughn, Amber E; Pratt, Charlotte; Ward, Dianne S
2015-01-01
Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls). Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES); lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI). From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES)] ratio among non-daycare children on weekdays and the L/(B+D+ES) ratio for all children on weekends. Daytime snack data were used to impute snacks. The reported mean (± standard deviation) weekday intake was lower for daycare children [725 (±324) kcal] compared to non-daycare children [1,048 (±463) kcal]. Weekend intake for all children was 1,173 (±427) kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409) kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.
A suggested approach for imputation of missing dietary data for young children in daycare
Directory of Open Access Journals (Sweden)
June Stevens
2015-12-01
Full Text Available Background: Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. Objective: The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Design: Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls. Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES; lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI. From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES] ratio among non-daycare children on weekdays and the L/(B+D+ES ratio for all children on weekends. Daytime snack data were used to impute snacks. Results: The reported mean (± standard deviation weekday intake was lower for daycare children [725 (±324 kcal] compared to non-daycare children [1,048 (±463 kcal]. Weekend intake for all children was 1,173 (±427 kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409 kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. Conclusion: This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.
Robust Methods for Moderation Analysis with a Two-Level Regression Model.
Yang, Miao; Yuan, Ke-Hai
2016-01-01
Moderation analysis has many applications in social sciences. Most widely used estimation methods for moderation analysis assume that errors are normally distributed and homoscedastic. When these assumptions are not met, the results from a classical moderation analysis can be misleading. For more reliable moderation analysis, this article proposes two robust methods with a two-level regression model when the predictors do not contain measurement error. One method is based on maximum likelihood with Student's t distribution and the other is based on M-estimators with Huber-type weights. An algorithm for obtaining the robust estimators is developed. Consistent estimates of standard errors of the robust estimators are provided. The robust approaches are compared against normal-distribution-based maximum likelihood (NML) with respect to power and accuracy of parameter estimates through a simulation study. Results show that the robust approaches outperform NML under various distributional conditions. Application of the robust methods is illustrated through a real data example. An R program is developed and documented to facilitate the application of the robust methods.
Applications of Monte Carlo method to nonlinear regression of rheological data
Kim, Sangmo; Lee, Junghaeng; Kim, Sihyun; Cho, Kwang Soo
2018-02-01
In rheological study, it is often to determine the parameters of rheological models from experimental data. Since both rheological data and values of the parameters vary in logarithmic scale and the number of the parameters is quite large, conventional method of nonlinear regression such as Levenberg-Marquardt (LM) method is usually ineffective. The gradient-based method such as LM is apt to be caught in local minima which give unphysical values of the parameters whenever the initial guess of the parameters is far from the global optimum. Although this problem could be solved by simulated annealing (SA), the Monte Carlo (MC) method needs adjustable parameter which could be determined in ad hoc manner. We suggest a simplified version of SA, a kind of MC methods which results in effective values of the parameters of most complicated rheological models such as the Carreau-Yasuda model of steady shear viscosity, discrete relaxation spectrum and zero-shear viscosity as a function of concentration and molecular weight.
Logistic Regression and Path Analysis Method to Analyze Factors influencing Students’ Achievement
Noeryanti, N.; Suryowati, K.; Setyawan, Y.; Aulia, R. R.
2018-04-01
Students' academic achievement cannot be separated from the influence of two factors namely internal and external factors. The first factors of the student (internal factors) consist of intelligence (X1), health (X2), interest (X3), and motivation of students (X4). The external factors consist of family environment (X5), school environment (X6), and society environment (X7). The objects of this research are eighth grade students of the school year 2016/2017 at SMPN 1 Jiwan Madiun sampled by using simple random sampling. Primary data are obtained by distributing questionnaires. The method used in this study is binary logistic regression analysis that aims to identify internal and external factors that affect student’s achievement and how the trends of them. Path Analysis was used to determine the factors that influence directly, indirectly or totally on student’s achievement. Based on the results of binary logistic regression, variables that affect student’s achievement are interest and motivation. And based on the results obtained by path analysis, factors that have a direct impact on student’s achievement are students’ interest (59%) and students’ motivation (27%). While the factors that have indirect influences on students’ achievement, are family environment (97%) and school environment (37).
Wulandari, S. P.; Salamah, M.; Rositawati, A. F. D.
2018-04-01
Food security is the condition where the food fulfilment is managed well for the country till the individual. Indonesia is one of the country which has the commitment to create the food security becomes main priority. However, the food necessity becomes common thing means that it doesn’t care about nutrient standard and the health condition of family member, so in the fulfilment of food necessity also has to consider the disease suffered by the family member, one of them is pulmonary tuberculosa. From that reasons, this research is conducted to know the factors which influence on household food security status which suffered from pulmonary tuberculosis in the coastal area of Surabaya by using binary logistic regression method. The analysis result by using binary logistic regression shows that the variables wife latest education, house density and spacious house ventilation significantly affect on household food security status which suffered from pulmonary tuberculosis in the coastal area of Surabaya, where the wife education level is University/equivalent, the house density is eligible or 8 m2/person and spacious house ventilation 10% of the floor area has the opportunity to become food secure households amounted to 0.911089. While the chance of becoming food insecure households amounted to 0.088911. The model household food security status which suffered from pulmonary tuberculosis in the coastal area of Surabaya has been conformable, and the overall percentages of those classifications are at 71.8%.
International Nuclear Information System (INIS)
Gupta, N
2008-01-01
3013 containers are designed in accordance with the DOE-STD-3013-2004. These containers are qualified to store plutonium (Pu) bearing materials such as PuO2 for 50 years. DOT shipping packages such as the 9975 are used to store the 3013 containers in the K-Area Material Storage (KAMS) facility at Savannah River Site (SRS). DOE-STD-3013-2004 requires that a comprehensive surveillance program be set up to ensure that the 3013 container design parameters are not violated during the long term storage. To ensure structural integrity of the 3013 containers, thermal analyses using finite element models were performed to predict the contents and component temperatures for different but well defined parameters such as storage ambient temperature, PuO 2 density, fill heights, weights, and thermal loading. Interpolation is normally used to calculate temperatures if the actual parameter values are different from the analyzed values. A statistical analysis technique using regression methods is proposed to develop simple polynomial relations to predict temperatures for the actual parameter values found in the containers. The analysis shows that regression analysis is a powerful tool to develop simple relations to assess component temperatures
Multi-step polynomial regression method to model and forecast malaria incidence.
Directory of Open Access Journals (Sweden)
Chandrajit Chatterjee
Full Text Available Malaria is one of the most severe problems faced by the world even today. Understanding the causative factors such as age, sex, social factors, environmental variability etc. as well as underlying transmission dynamics of the disease is important for epidemiological research on malaria and its eradication. Thus, development of suitable modeling approach and methodology, based on the available data on the incidence of the disease and other related factors is of utmost importance. In this study, we developed a simple non-linear regression methodology in modeling and forecasting malaria incidence in Chennai city, India, and predicted future disease incidence with high confidence level. We considered three types of data to develop the regression methodology: a longer time series data of Slide Positivity Rates (SPR of malaria; a smaller time series data (deaths due to Plasmodium vivax of one year; and spatial data (zonal distribution of P. vivax deaths for the city along with the climatic factors, population and previous incidence of the disease. We performed variable selection by simple correlation study, identification of the initial relationship between variables through non-linear curve fitting and used multi-step methods for induction of variables in the non-linear regression analysis along with applied Gauss-Markov models, and ANOVA for testing the prediction, validity and constructing the confidence intervals. The results execute the applicability of our method for different types of data, the autoregressive nature of forecasting, and show high prediction power for both SPR and P. vivax deaths, where the one-lag SPR values plays an influential role and proves useful for better prediction. Different climatic factors are identified as playing crucial role on shaping the disease curve. Further, disease incidence at zonal level and the effect of causative factors on different zonal clusters indicate the pattern of malaria prevalence in the city
A New Global Regression Analysis Method for the Prediction of Wind Tunnel Model Weight Corrections
Ulbrich, Norbert Manfred; Bridge, Thomas M.; Amaya, Max A.
2014-01-01
A new global regression analysis method is discussed that predicts wind tunnel model weight corrections for strain-gage balance loads during a wind tunnel test. The method determines corrections by combining "wind-on" model attitude measurements with least squares estimates of the model weight and center of gravity coordinates that are obtained from "wind-off" data points. The method treats the least squares fit of the model weight separate from the fit of the center of gravity coordinates. Therefore, it performs two fits of "wind- off" data points and uses the least squares estimator of the model weight as an input for the fit of the center of gravity coordinates. Explicit equations for the least squares estimators of the weight and center of gravity coordinates are derived that simplify the implementation of the method in the data system software of a wind tunnel. In addition, recommendations for sets of "wind-off" data points are made that take typical model support system constraints into account. Explicit equations of the confidence intervals on the model weight and center of gravity coordinates and two different error analyses of the model weight prediction are also discussed in the appendices of the paper.
Hwang, Kyu-Baek; Lee, In-Hee; Park, Jin-Ho; Hambuch, Tina; Choe, Yongjoon; Kim, MinHyeok; Lee, Kyungjoon; Song, Taemin; Neu, Matthew B; Gupta, Neha; Kohane, Isaac S; Green, Robert C; Kong, Sek Won
2014-08-01
As whole genome sequencing (WGS) uncovers variants associated with rare and common diseases, an immediate challenge is to minimize false-positive findings due to sequencing and variant calling errors. False positives can be reduced by combining results from orthogonal sequencing methods, but costly. Here, we present variant filtering approaches using logistic regression (LR) and ensemble genotyping to minimize false positives without sacrificing sensitivity. We evaluated the methods using paired WGS datasets of an extended family prepared using two sequencing platforms and a validated set of variants in NA12878. Using LR or ensemble genotyping based filtering, false-negative rates were significantly reduced by 1.1- to 17.8-fold at the same levels of false discovery rates (5.4% for heterozygous and 4.5% for homozygous single nucleotide variants (SNVs); 30.0% for heterozygous and 18.7% for homozygous insertions; 25.2% for heterozygous and 16.6% for homozygous deletions) compared to the filtering based on genotype quality scores. Moreover, ensemble genotyping excluded > 98% (105,080 of 107,167) of false positives while retaining > 95% (897 of 937) of true positives in de novo mutation (DNM) discovery in NA12878, and performed better than a consensus method using two sequencing platforms. Our proposed methods were effective in prioritizing phenotype-associated variants, and an ensemble genotyping would be essential to minimize false-positive DNM candidates. © 2014 WILEY PERIODICALS, INC.
A dynamic particle filter-support vector regression method for reliability prediction
International Nuclear Information System (INIS)
Wei, Zhao; Tao, Tao; ZhuoShu, Ding; Zio, Enrico
2013-01-01
Support vector regression (SVR) has been applied to time series prediction and some works have demonstrated the feasibility of its use to forecast system reliability. For accuracy of reliability forecasting, the selection of SVR's parameters is important. The existing research works on SVR's parameters selection divide the example dataset into training and test subsets, and tune the parameters on the training data. However, these fixed parameters can lead to poor prediction capabilities if the data of the test subset differ significantly from those of training. Differently, the novel method proposed in this paper uses particle filtering to estimate the SVR model parameters according to the whole measurement sequence up to the last observation instance. By treating the SVR training model as the observation equation of a particle filter, our method allows updating the SVR model parameters dynamically when a new observation comes. Because of the adaptability of the parameters to dynamic data pattern, the new PF–SVR method has superior prediction performance over that of standard SVR. Four application results show that PF–SVR is more robust than SVR to the decrease of the number of training data and the change of initial SVR parameter values. Also, even if there are trends in the test data different from those in the training data, the method can capture the changes, correct the SVR parameters and obtain good predictions. -- Highlights: •A dynamic PF–SVR method is proposed to predict the system reliability. •The method can adjust the SVR parameters according to the change of data. •The method is robust to the size of training data and initial parameter values. •Some cases based on both artificial and real data are studied. •PF–SVR shows superior prediction performance over standard SVR
Statistical learning method in regression analysis of simulated positron spectral data
International Nuclear Information System (INIS)
Avdic, S. Dz.
2005-01-01
Positron lifetime spectroscopy is a non-destructive tool for detection of radiation induced defects in nuclear reactor materials. This work concerns the applicability of the support vector machines method for the input data compression in the neural network analysis of positron lifetime spectra. It has been demonstrated that the SVM technique can be successfully applied to regression analysis of positron spectra. A substantial data compression of about 50 % and 8 % of the whole training set with two and three spectral components respectively has been achieved including a high accuracy of the spectra approximation. However, some parameters in the SVM approach such as the insensitivity zone e and the penalty parameter C have to be chosen carefully to obtain a good performance. (author)
Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes
Chatterjee, Nilanjan
2009-11-01
Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.
Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes
Chatterjee, Nilanjan; Chen, Yi-Hau; Luo, Sheng; Carroll, Raymond J.
2009-01-01
Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.
The crux of the method: assumptions in ordinary least squares and logistic regression.
Long, Rebecca G
2008-10-01
Logistic regression has increasingly become the tool of choice when analyzing data with a binary dependent variable. While resources relating to the technique are widely available, clear discussions of why logistic regression should be used in place of ordinary least squares regression are difficult to find. The current paper compares and contrasts the assumptions of ordinary least squares with those of logistic regression and explains why logistic regression's looser assumptions make it adept at handling violations of the more important assumptions in ordinary least squares.
Austin, Peter C; Lee, Douglas S; Steyerberg, Ewout W; Tu, Jack V
2012-01-01
In biomedical research, the logistic regression model is the most commonly used method for predicting the probability of a binary outcome. While many clinical researchers have expressed an enthusiasm for regression trees, this method may have limited accuracy for predicting health outcomes. We aimed to evaluate the improvement that is achieved by using ensemble-based methods, including bootstrap aggregation (bagging) of regression trees, random forests, and boosted regression trees. We analyzed 30-day mortality in two large cohorts of patients hospitalized with either acute myocardial infarction (N = 16,230) or congestive heart failure (N = 15,848) in two distinct eras (1999–2001 and 2004–2005). We found that both the in-sample and out-of-sample prediction of ensemble methods offered substantial improvement in predicting cardiovascular mortality compared to conventional regression trees. However, conventional logistic regression models that incorporated restricted cubic smoothing splines had even better performance. We conclude that ensemble methods from the data mining and machine learning literature increase the predictive performance of regression trees, but may not lead to clear advantages over conventional logistic regression models for predicting short-term mortality in population-based samples of subjects with cardiovascular disease. PMID:22777999
Dinç, Erdal; Ustündağ, Ozgür; Baleanu, Dumitru
2010-08-01
The sole use of pyridoxine hydrochloride during treatment of tuberculosis gives rise to pyridoxine deficiency. Therefore, a combination of pyridoxine hydrochloride and isoniazid is used in pharmaceutical dosage form in tuberculosis treatment to reduce this side effect. In this study, two chemometric methods, partial least squares (PLS) and principal component regression (PCR), were applied to the simultaneous determination of pyridoxine (PYR) and isoniazid (ISO) in their tablets. A concentration training set comprising binary mixtures of PYR and ISO consisting of 20 different combinations were randomly prepared in 0.1 M HCl. Both multivariate calibration models were constructed using the relationships between the concentration data set (concentration data matrix) and absorbance data matrix in the spectral region 200-330 nm. The accuracy and the precision of the proposed chemometric methods were validated by analyzing synthetic mixtures containing the investigated drugs. The recovery results obtained by applying PCR and PLS calibrations to the artificial mixtures were found between 100.0 and 100.7%. Satisfactory results obtained by applying the PLS and PCR methods to both artificial and commercial samples were obtained. The results obtained in this manuscript strongly encourage us to use them for the quality control and the routine analysis of the marketing tablets containing PYR and ISO drugs. Copyright © 2010 John Wiley & Sons, Ltd.
International Nuclear Information System (INIS)
Tsushima, Motoo; Fujii, Shigeki; Yutani, Chikao; Yamamoto, Akira; Naitoh, Hiroaki.
1990-01-01
We evaluated the wall thickening and stenosis rate (ASI), the calcification rate (ACI), and the wall thickening and calcification stenosis rate (SCI) of the lower abdominal aorta calculated by the 12 sector method from simple or enhanced computed tomography. The intra-observer variation of the calculation of ASI was 5.7% and that of ACI was 2.4%. In 9 patients who underwent an autopsy examination, ACI was significantly correlated with the rate of the calcification dimension to the whole objective area of the abdominal aorta (r=0.856, p<0.01). However, there were no correlations between ASI and the surface involvement or the atherosclerotic index obtained by the point-counting method of the autopsy materials. In the analysis of 40 patients with atherosclerotic vascular diseases, ASI and ACI were also highly correlated with the percentage volume of the arterial wall in relation to the whole volume of the observed artery (r=0.852, p<0.0001) and also the percentage calcification volume (r=0.913, p<0.0001) calculated by the computed method, respectively. The percentage of atherosclerotic vascular diseases increased in the group of both high ASI (over 10%) and high ACI (over 20%). We used SCI as a reliable index when the progression and regression of atherosclerosis was considered. Among patients of hypercholesterolemia consisting of 15 with familial hypercholesterolemia (FH) and 6 non-FH patients, the change of SCI (d-SCI) was significantly correlated with the change of total cholesterol concentration (d-TC) after the treatment (r=0.466, p<0.05) and the change of the right Achilles' tendon thickening (d-ATT) was also correlated with d-TC (r=0.634, p<0.005). However, no correlation between d-SCI and d-ATT was observed. In conclusion, CT indices of atherosclerosis were useful as a noninvasive quantitative diagnostic method and we were able to use them to assess the progression and regression of atherosclerosis. (author)
Learning-Based Adaptive Imputation Methodwith kNN Algorithm for Missing Power Data
Directory of Open Access Journals (Sweden)
Minkyung Kim
2017-10-01
Full Text Available This paper proposes a learning-based adaptive imputation method (LAI for imputing missing power data in an energy system. This method estimates the missing power data by using the pattern that appears in the collected data. Here, in order to capture the patterns from past power data, we newly model a feature vector by using past data and its variations. The proposed LAI then learns the optimal length of the feature vector and the optimal historical length, which are significant hyper parameters of the proposed method, by utilizing intentional missing data. Based on a weighted distance between feature vectors representing a missing situation and past situation, missing power data are estimated by referring to the k most similar past situations in the optimal historical length. We further extend the proposed LAI to alleviate the effect of unexpected variation in power data and refer to this new approach as the extended LAI method (eLAI. The eLAI selects a method between linear interpolation (LI and the proposed LAI to improve accuracy under unexpected variations. Finally, from a simulation under various energy consumption profiles, we verify that the proposed eLAI achieves about a 74% reduction of the average imputation error in an energy system, compared to the existing imputation methods.
Reflexion on linear regression trip production modelling method for ensuring good model quality
Suprayitno, Hitapriya; Ratnasari, Vita
2017-11-01
Transport Modelling is important. For certain cases, the conventional model still has to be used, in which having a good trip production model is capital. A good model can only be obtained from a good sample. Two of the basic principles of a good sampling is having a sample capable to represent the population characteristics and capable to produce an acceptable error at a certain confidence level. It seems that this principle is not yet quite understood and used in trip production modeling. Therefore, investigating the Trip Production Modelling practice in Indonesia and try to formulate a better modeling method for ensuring the Model Quality is necessary. This research result is presented as follows. Statistics knows a method to calculate span of prediction value at a certain confidence level for linear regression, which is called Confidence Interval of Predicted Value. The common modeling practice uses R2 as the principal quality measure, the sampling practice varies and not always conform to the sampling principles. An experiment indicates that small sample is already capable to give excellent R2 value and sample composition can significantly change the model. Hence, good R2 value, in fact, does not always mean good model quality. These lead to three basic ideas for ensuring good model quality, i.e. reformulating quality measure, calculation procedure, and sampling method. A quality measure is defined as having a good R2 value and a good Confidence Interval of Predicted Value. Calculation procedure must incorporate statistical calculation method and appropriate statistical tests needed. A good sampling method must incorporate random well distributed stratified sampling with a certain minimum number of samples. These three ideas need to be more developed and tested.
Directory of Open Access Journals (Sweden)
Nina L. Timofeeva
2014-01-01
Full Text Available The article presents the methodological and technical bases for the creation of regression models that adequately reflect reality. The focus is on methods of removing residual autocorrelation in models. Algorithms eliminating heteroscedasticity and autocorrelation of the regression model residuals: reweighted least squares method, the method of Cochran-Orkutta are given. A model of "pure" regression is build, as well as to compare the effect on the dependent variable of the different explanatory variables when the latter are expressed in different units, a standardized form of the regression equation. The scheme of abatement techniques of heteroskedasticity and autocorrelation for the creation of regression models specific to the social and cultural sphere is developed.
2017-12-01
Fig. 2 Simulation method; the process for one iteration of the simulation . It was repeated 250 times per combination of HR and FAR. Analysis was...distribution is unlimited. 8 Fig. 2 Simulation method; the process for one iteration of the simulation . It was repeated 250 times per combination of HR...stimuli. Simulations show that this regression method results in an unbiased and accurate estimate of target detection performance. The regression
Isa, Zakiah Mohd; Tawfiq, Omar Farouq; Noor, Norliza Mohd; Shamsudheen, Mohd Iqbal; Rijal, Omar Mohd
2010-03-01
In rehabilitating edentulous patients, selecting appropriately sized teeth in the absence of preextraction records is problematic. The purpose of this study was to investigate the relationships between some facial dimensions and widths of the maxillary anterior teeth to potentially provide a guide for tooth selection. Sixty full dentate Malaysian adults (18-36 years) representing 2 ethnic groups (Malay and Chinese), with well aligned maxillary anterior teeth and minimal attrition, participated in this study. Standardized digital images of the face, viewed frontally, were recorded. Using image analyzing software, the images were used to determine the interpupillary distance (IPD), inner canthal distance (ICD), and interalar width (IA). Widths of the 6 maxillary anterior teeth were measured directly from casts of the subjects using digital calipers. Regression analyses were conducted to measure the strength of the associations between the variables (alpha=.10). The means (standard deviations) of IPD, IA, and ICD of the subjects were 62.28 (2.47), 39.36 (3.12), and 34.36 (2.15) mm, respectively. The mesiodistal diameters of the maxillary central incisors, lateral incisors, and canines were 8.54 (0.50), 7.09 (0.48), and 7.94 (0.40) mm, respectively. The width of the central incisors was highly correlated to the IPD (r=0.99), while the widths of the lateral incisors and canines were highly correlated to a combination of IPD and IA (r=0.99 and 0.94, respectively). Using regression methods, the widths of the anterior teeth within the population tested may be predicted by a combination of the facial dimensions studied. (c) 2010 The Editorial Council of the Journal of Prosthetic Dentistry. Published by Mosby, Inc. All rights reserved.
Delwiche, Stephen R; Reeves, James B
2010-01-01
In multivariate regression analysis of spectroscopy data, spectral preprocessing is often performed to reduce unwanted background information (offsets, sloped baselines) or accentuate absorption features in intrinsically overlapping bands. These procedures, also known as pretreatments, are commonly smoothing operations or derivatives. While such operations are often useful in reducing the number of latent variables of the actual decomposition and lowering residual error, they also run the risk of misleading the practitioner into accepting calibration equations that are poorly adapted to samples outside of the calibration. The current study developed a graphical method to examine this effect on partial least squares (PLS) regression calibrations of near-infrared (NIR) reflection spectra of ground wheat meal with two analytes, protein content and sodium dodecyl sulfate sedimentation (SDS) volume (an indicator of the quantity of the gluten proteins that contribute to strong doughs). These two properties were chosen because of their differing abilities to be modeled by NIR spectroscopy: excellent for protein content, fair for SDS sedimentation volume. To further demonstrate the potential pitfalls of preprocessing, an artificial component, a randomly generated value, was included in PLS regression trials. Savitzky-Golay (digital filter) smoothing, first-derivative, and second-derivative preprocess functions (5 to 25 centrally symmetric convolution points, derived from quadratic polynomials) were applied to PLS calibrations of 1 to 15 factors. The results demonstrated the danger of an over reliance on preprocessing when (1) the number of samples used in a multivariate calibration is low (<50), (2) the spectral response of the analyte is weak, and (3) the goodness of the calibration is based on the coefficient of determination (R(2)) rather than a term based on residual error. The graphical method has application to the evaluation of other preprocess functions and various
Improved Correction of Misclassification Bias With Bootstrap Imputation.
van Walraven, Carl
2018-07-01
Diagnostic codes used in administrative database research can create bias due to misclassification. Quantitative bias analysis (QBA) can correct for this bias, requires only code sensitivity and specificity, but may return invalid results. Bootstrap imputation (BI) can also address misclassification bias but traditionally requires multivariate models to accurately estimate disease probability. This study compared misclassification bias correction using QBA and BI. Serum creatinine measures were used to determine severe renal failure status in 100,000 hospitalized patients. Prevalence of severe renal failure in 86 patient strata and its association with 43 covariates was determined and compared with results in which renal failure status was determined using diagnostic codes (sensitivity 71.3%, specificity 96.2%). Differences in results (misclassification bias) were then corrected with QBA or BI (using progressively more complex methods to estimate disease probability). In total, 7.4% of patients had severe renal failure. Imputing disease status with diagnostic codes exaggerated prevalence estimates [median relative change (range), 16.6% (0.8%-74.5%)] and its association with covariates [median (range) exponentiated absolute parameter estimate difference, 1.16 (1.01-2.04)]. QBA produced invalid results 9.3% of the time and increased bias in estimates of both disease prevalence and covariate associations. BI decreased misclassification bias with increasingly accurate disease probability estimates. QBA can produce invalid results and increase misclassification bias. BI avoids invalid results and can importantly decrease misclassification bias when accurate disease probability estimates are used.
Kabisch, Maria; Hamann, Ute; Lorenzo Bermejo, Justo
2017-10-17
Genotypes not directly measured in genetic studies are often imputed to improve statistical power and to increase mapping resolution. The accuracy of standard imputation techniques strongly depends on the similarity of linkage disequilibrium (LD) patterns in the study and reference populations. Here we develop a novel approach for genotype imputation in low-recombination regions that relies on the coalescent and permits to explicitly account for population demographic factors. To test the new method, study and reference haplotypes were simulated and gene trees were inferred under the basic coalescent and also considering population growth and structure. The reference haplotypes that first coalesced with study haplotypes were used as templates for genotype imputation. Computer simulations were complemented with the analysis of real data. Genotype concordance rates were used to compare the accuracies of coalescent-based and standard (IMPUTE2) imputation. Simulations revealed that, in LD-blocks, imputation accuracy relying on the basic coalescent was higher and less variable than with IMPUTE2. Explicit consideration of population growth and structure, even if present, did not practically improve accuracy. The advantage of coalescent-based over standard imputation increased with the minor allele frequency and it decreased with population stratification. Results based on real data indicated that, even in low-recombination regions, further research is needed to incorporate recombination in coalescence inference, in particular for studies with genetically diverse and admixed individuals. To exploit the full potential of coalescent-based methods for the imputation of missing genotypes in genetic studies, further methodological research is needed to reduce computer time, to take into account recombination, and to implement these methods in user-friendly computer programs. Here we provide reproducible code which takes advantage of publicly available software to facilitate
Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.
Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha
2015-12-01
Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length. Copyright © 2015 Elsevier Inc. All rights reserved.
Whole-genome regression and prediction methods applied to plant and animal breeding
Los Campos, De G.; Hickey, J.M.; Pong-Wong, R.; Daetwyler, H.D.; Calus, M.P.L.
2013-01-01
Genomic-enabled prediction is becoming increasingly important in animal and plant breeding, and is also receiving attention in human genetics. Deriving accurate predictions of complex traits requires implementing whole-genome regression (WGR) models where phenotypes are regressed on thousands of
Modelling infant mortality rate in Central Java, Indonesia use generalized poisson regression method
Prahutama, Alan; Sudarno
2018-05-01
The infant mortality rate is the number of deaths under one year of age occurring among the live births in a given geographical area during a given year, per 1,000 live births occurring among the population of the given geographical area during the same year. This problem needs to be addressed because it is an important element of a country’s economic development. High infant mortality rate will disrupt the stability of a country as it relates to the sustainability of the population in the country. One of regression model that can be used to analyze the relationship between dependent variable Y in the form of discrete data and independent variable X is Poisson regression model. Recently The regression modeling used for data with dependent variable is discrete, among others, poisson regression, negative binomial regression and generalized poisson regression. In this research, generalized poisson regression modeling gives better AIC value than poisson regression. The most significant variable is the Number of health facilities (X1), while the variable that gives the most influence to infant mortality rate is the average breastfeeding (X9).
Selecting minimum dataset soil variables using PLSR as a regressive multivariate method
Stellacci, Anna Maria; Armenise, Elena; Castellini, Mirko; Rossi, Roberta; Vitti, Carolina; Leogrande, Rita; De Benedetto, Daniela; Ferrara, Rossana M.; Vivaldi, Gaetano A.
2017-04-01
Long-term field experiments and science-based tools that characterize soil status (namely the soil quality indices, SQIs) assume a strategic role in assessing the effect of agronomic techniques and thus in improving soil management especially in marginal environments. Selecting key soil variables able to best represent soil status is a critical step for the calculation of SQIs. Current studies show the effectiveness of statistical methods for variable selection to extract relevant information deriving from multivariate datasets. Principal component analysis (PCA) has been mainly used, however supervised multivariate methods and regressive techniques are progressively being evaluated (Armenise et al., 2013; de Paul Obade et al., 2016; Pulido Moncada et al., 2014). The present study explores the effectiveness of partial least square regression (PLSR) in selecting critical soil variables, using a dataset comparing conventional tillage and sod-seeding on durum wheat. The results were compared to those obtained using PCA and stepwise discriminant analysis (SDA). The soil data derived from a long-term field experiment in Southern Italy. On samples collected in April 2015, the following set of variables was quantified: (i) chemical: total organic carbon and nitrogen (TOC and TN), alkali-extractable C (TEC and humic substances - HA-FA), water extractable N and organic C (WEN and WEOC), Olsen extractable P, exchangeable cations, pH and EC; (ii) physical: texture, dry bulk density (BD), macroporosity (Pmac), air capacity (AC), and relative field capacity (RFC); (iii) biological: carbon of the microbial biomass quantified with the fumigation-extraction method. PCA and SDA were previously applied to the multivariate dataset (Stellacci et al., 2016). PLSR was carried out on mean centered and variance scaled data of predictors (soil variables) and response (wheat yield) variables using the PLS procedure of SAS/STAT. In addition, variable importance for projection (VIP
EPMLR: sequence-based linear B-cell epitope prediction method using multiple linear regression.
Lian, Yao; Ge, Meng; Pan, Xian-Ming
2014-12-19
B-cell epitopes have been studied extensively due to their immunological applications, such as peptide-based vaccine development, antibody production, and disease diagnosis and therapy. Despite several decades of research, the accurate prediction of linear B-cell epitopes has remained a challenging task. In this work, based on the antigen's primary sequence information, a novel linear B-cell epitope prediction model was developed using the multiple linear regression (MLR). A 10-fold cross-validation test on a large non-redundant dataset was performed to evaluate the performance of our model. To alleviate the problem caused by the noise of negative dataset, 300 experiments utilizing 300 sub-datasets were performed. We achieved overall sensitivity of 81.8%, precision of 64.1% and area under the receiver operating characteristic curve (AUC) of 0.728. We have presented a reliable method for the identification of linear B cell epitope using antigen's primary sequence information. Moreover, a web server EPMLR has been developed for linear B-cell epitope prediction: http://www.bioinfo.tsinghua.edu.cn/epitope/EPMLR/ .
Lazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, Thomas
2016-04-01
Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.
Jiao, S; Tiezzi, F; Huang, Y; Gray, K A; Maltecca, C
2016-02-01
Obtaining accurate individual feed intake records is the key first step in achieving genetic progress toward more efficient nutrient utilization in pigs. Feed intake records collected by electronic feeding systems contain errors (erroneous and abnormal values exceeding certain cutoff criteria), which are due to feeder malfunction or animal-feeder interaction. In this study, we examined the use of a novel data-editing strategy involving multiple imputation to minimize the impact of errors and missing values on the quality of feed intake data collected by an electronic feeding system. Accuracy of feed intake data adjustment obtained from the conventional linear mixed model (LMM) approach was compared with 2 alternative implementations of multiple imputation by chained equation, denoted as MI (multiple imputation) and MICE (multiple imputation by chained equation). The 3 methods were compared under 3 scenarios, where 5, 10, and 20% feed intake error rates were simulated. Each of the scenarios was replicated 5 times. Accuracy of the alternative error adjustment was measured as the correlation between the true daily feed intake (DFI; daily feed intake in the testing period) or true ADFI (the mean DFI across testing period) and the adjusted DFI or adjusted ADFI. In the editing process, error cutoff criteria are used to define if a feed intake visit contains errors. To investigate the possibility that the error cutoff criteria may affect any of the 3 methods, the simulation was repeated with 2 alternative error cutoff values. Multiple imputation methods outperformed the LMM approach in all scenarios with mean accuracies of 96.7, 93.5, and 90.2% obtained with MI and 96.8, 94.4, and 90.1% obtained with MICE compared with 91.0, 82.6, and 68.7% using LMM for DFI. Similar results were obtained for ADFI. Furthermore, multiple imputation methods consistently performed better than LMM regardless of the cutoff criteria applied to define errors. In conclusion, multiple imputation
Nonparametric autocovariance estimation from censored time series by Gaussian imputation.
Park, Jung Wook; Genton, Marc G; Ghosh, Sujit K
2009-02-01
One of the most frequently used methods to model the autocovariance function of a second-order stationary time series is to use the parametric framework of autoregressive and moving average models developed by Box and Jenkins. However, such parametric models, though very flexible, may not always be adequate to model autocovariance functions with sharp changes. Furthermore, if the data do not follow the parametric model and are censored at a certain value, the estimation results may not be reliable. We develop a Gaussian imputation method to estimate an autocovariance structure via nonparametric estimation of the autocovariance function in order to address both censoring and incorrect model specification. We demonstrate the effectiveness of the technique in terms of bias and efficiency with simulations under various rates of censoring and underlying models. We describe its application to a time series of silicon concentrations in the Arctic.
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Cui, Jonathan J; Basques, Bryce A; Albert, Todd J; Grauer, Jonathan N
2018-04-09
The presence of missing data is a limitation of large datasets, including the National Surgical Quality Improvement Program (NSQIP). In addressing this issue, most studies use complete case analysis, which excludes cases with missing data, thus potentially introducing selection bias. Multiple imputation, a statistically rigorous approach that approximates missing data and preserves sample size, may be an improvement over complete case analysis. The present study aims to evaluate the impact of using multiple imputation in comparison with complete case analysis for assessing the associations between preoperative laboratory values and adverse outcomes following anterior cervical discectomy and fusion (ACDF) procedures. This is a retrospective review of prospectively collected data. Patients undergoing one-level ACDF were identified in NSQIP 2012-2015. Perioperative adverse outcome variables assessed included the occurrence of any adverse event, severe adverse events, and hospital readmission. Missing preoperative albumin and hematocrit values were handled using complete case analysis and multiple imputation. These preoperative laboratory levels were then tested for associations with 30-day postoperative outcomes using logistic regression. A total of 11,999 patients were included. Of this cohort, 63.5% of patients had missing preoperative albumin and 9.9% had missing preoperative hematocrit. When using complete case analysis, only 4,311 patients were studied. The removed patients were significantly younger, healthier, of a common body mass index, and male. Logistic regression analysis failed to identify either preoperative hypoalbuminemia or preoperative anemia as significantly associated with adverse outcomes. When employing multiple imputation, all 11,999 patients were included. Preoperative hypoalbuminemia was significantly associated with the occurrence of any adverse event and severe adverse events. Preoperative anemia was significantly associated with the
Functional regression method for whole genome eQTL epistasis analysis with sequencing data.
Xu, Kelin; Jin, Li; Xiong, Momiao
2017-05-18
Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells, RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and whole genome sequencing (WGS) data poses enormous challenges. We develop a nonlinear functional regression model (FRGM) with functional responses where the position-level read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNA-seq data. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and use all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions. By large-scale simulations, we demonstrate that the proposed FRGM for epistasis analysis can achieve the correct type 1 error and has higher power to detect the interactions between genes than the existing methods. The proposed methods are applied to the RNA-seq and WGS data from the 1000 Genome Project. The numbers of pairs of significantly interacting genes after Bonferroni correction
Directory of Open Access Journals (Sweden)
Sergei Vladimirovich Varaksin
2017-06-01
Full Text Available Purpose. Construction of a mathematical model of the dynamics of childbearing change in the Altai region in 2000–2016, analysis of the dynamics of changes in birth rates for multiple age categories of women of childbearing age. Methodology. A auxiliary analysis element is the construction of linear mathematical models of the dynamics of childbearing by using fuzzy linear regression method based on fuzzy numbers. Fuzzy linear regression is considered as an alternative to standard statistical linear regression for short time series and unknown distribution law. The parameters of fuzzy linear and standard statistical regressions for childbearing time series were defined with using the built in language MatLab algorithm. Method of fuzzy linear regression is not used in sociological researches yet. Results. There are made the conclusions about the socio-demographic changes in society, the high efficiency of the demographic policy of the leadership of the region and the country, and the applicability of the method of fuzzy linear regression for sociological analysis.
Kmetic, Andrew; Joseph, Lawrence; Berger, Claudie; Tenenhouse, Alan
2002-07-01
Nonresponse bias is a concern in any epidemiologic survey in which a subset of selected individuals declines to participate. We reviewed multiple imputation, a widely applicable and easy to implement Bayesian methodology to adjust for nonresponse bias. To illustrate the method, we used data from the Canadian Multicentre Osteoporosis Study, a large cohort study of 9423 randomly selected Canadians, designed in part to estimate the prevalence of osteoporosis. Although subjects were randomly selected, only 42% of individuals who were contacted agreed to participate fully in the study. The study design included a brief questionnaire for those invitees who declined further participation in order to collect information on the major risk factors for osteoporosis. These risk factors (which included age, sex, previous fractures, family history of osteoporosis, and current smoking status) were then used to estimate the missing osteoporosis status for nonparticipants using multiple imputation. Both ignorable and nonignorable imputation models are considered. Our results suggest that selection bias in the study is of concern, but only slightly, in very elderly (age 80+ years), both women and men. Epidemiologists should consider using multiple imputation more often than is current practice.
International Nuclear Information System (INIS)
Wang Weida; Xia Junding; Zhou Zhixin; Leung, P.L.
2001-01-01
Thermoluminescence (TL) dating using a regression method of saturating exponential in pre-dose technique was described. 23 porcelain samples from past dynasties of China were dated by this method. The results show that the TL ages are in reasonable agreement with archaeological dates within a standard deviation of 27%. Such error can be accepted in porcelain dating
RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.
Directory of Open Access Journals (Sweden)
Ji-Sung Kim
2018-04-01
Full Text Available Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees. RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error, precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9. We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1 a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2 uneven accessibility and subjective importance of prophylactic health, (3 possible variation in lifestyle, such as dietary habits, and (4 differences in background genetic variation which predispose to diseases.
RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Kim, Ji-Sung
2018-04-26
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.
Two-pass imputation algorithm for missing value estimation in gene expression time series.
Tsiporkova, Elena; Boeva, Veselka
2007-10-01
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different
The analysis of survival data in nephrology: basic concepts and methods of Cox regression
van Dijk, Paul C.; Jager, Kitty J.; Zwinderman, Aeilko H.; Zoccali, Carmine; Dekker, Friedo W.
2008-01-01
How much does the survival of one group differ from the survival of another group? How do differences in age in these two groups affect such a comparison? To obtain a quantity to compare the survival of different patient groups and to account for confounding effects, a multiple regression technique
Estimating traffic volume on Wyoming low volume roads using linear and logistic regression methods
Directory of Open Access Journals (Sweden)
Dick Apronti
2016-12-01
Full Text Available Traffic volume is an important parameter in most transportation planning applications. Low volume roads make up about 69% of road miles in the United States. Estimating traffic on the low volume roads is a cost-effective alternative to taking traffic counts. This is because traditional traffic counts are expensive and impractical for low priority roads. The purpose of this paper is to present the development of two alternative means of cost-effectively estimating traffic volumes for low volume roads in Wyoming and to make recommendations for their implementation. The study methodology involves reviewing existing studies, identifying data sources, and carrying out the model development. The utility of the models developed were then verified by comparing actual traffic volumes to those predicted by the model. The study resulted in two regression models that are inexpensive and easy to implement. The first regression model was a linear regression model that utilized pavement type, access to highways, predominant land use types, and population to estimate traffic volume. In verifying the model, an R2 value of 0.64 and a root mean square error of 73.4% were obtained. The second model was a logistic regression model that identified the level of traffic on roads using five thresholds or levels. The logistic regression model was verified by estimating traffic volume thresholds and determining the percentage of roads that were accurately classified as belonging to the given thresholds. For the five thresholds, the percentage of roads classified correctly ranged from 79% to 88%. In conclusion, the verification of the models indicated both model types to be useful for accurate and cost-effective estimation of traffic volumes for low volume Wyoming roads. The models developed were recommended for use in traffic volume estimations for low volume roads in pavement management and environmental impact assessment studies.
Energy Technology Data Exchange (ETDEWEB)
Keilacker, H; Becker, G; Ziegler, M; Gottschling, H D [Zentralinstitut fuer Diabetes, Karlsburg (German Democratic Republic)
1980-10-01
In order to handle all types of radioimmunoassay (RIA) calibration curves obtained in the authors' laboratory in the same way, they tried to find a non-linear expression for their regression which allows calibration curves with different degrees of curvature to be fitted. Considering the two boundary cases of the incubation protocol they derived a hyperbolic inverse regression function: x = a/sub 1/y + a/sub 0/ + asub(-1)y/sup -1/, where x is the total concentration of antigen, asub(i) are constants, and y is the specifically bound radioactivity. An RIA evaluation procedure based on this function is described providing a fitted inverse RIA calibration curve and some statistical quality parameters. The latter are of an order which is normal for RIA systems. There is an excellent agreement between fitted and experimentally obtained calibration curves having a different degree of curvature.
Directory of Open Access Journals (Sweden)
Kamatani Naoyuki
2011-05-01
Full Text Available Abstract Background Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required. Results We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo. Conclusions ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at http://en.sourceforge.jp/projects/parallelgwas/releases/.
Directory of Open Access Journals (Sweden)
Jun Bi
2018-04-01
Full Text Available Battery electric vehicles (BEVs reduce energy consumption and air pollution as compared with conventional vehicles. However, the limited driving range and potential long charging time of BEVs create new problems. Accurate charging time prediction of BEVs helps drivers determine travel plans and alleviate their range anxiety during trips. This study proposed a combined model for charging time prediction based on regression and time-series methods according to the actual data from BEVs operating in Beijing, China. After data analysis, a regression model was established by considering the charged amount for charging time prediction. Furthermore, a time-series method was adopted to calibrate the regression model, which significantly improved the fitting accuracy of the model. The parameters of the model were determined by using the actual data. Verification results confirmed the accuracy of the model and showed that the model errors were small. The proposed model can accurately depict the charging time characteristics of BEVs in Beijing.
Multiple imputation in the presence of non-normal data.
Lee, Katherine J; Carlin, John B
2017-02-20
Multiple imputation (MI) is becoming increasingly popular for handling missing data. Standard approaches for MI assume normality for continuous variables (conditionally on the other variables in the imputation model). However, it is unclear how to impute non-normally distributed continuous variables. Using simulation and a case study, we compared various transformations applied prior to imputation, including a novel non-parametric transformation, to imputation on the raw scale and using predictive mean matching (PMM) when imputing non-normal data. We generated data from a range of non-normal distributions, and set 50% to missing completely at random or missing at random. We then imputed missing values on the raw scale, following a zero-skewness log, Box-Cox or non-parametric transformation and using PMM with both type 1 and 2 matching. We compared inferences regarding the marginal mean of the incomplete variable and the association with a fully observed outcome. We also compared results from these approaches in the analysis of depression and anxiety symptoms in parents of very preterm compared with term-born infants. The results provide novel empirical evidence that the decision regarding how to impute a non-normal variable should be based on the nature of the relationship between the variables of interest. If the relationship is linear in the untransformed scale, transformation can introduce bias irrespective of the transformation used. However, if the relationship is non-linear, it may be important to transform the variable to accurately capture this relationship. A useful alternative is to impute the variable using PMM with type 1 matching. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Peyre, Hugo; Leplège, Alain; Coste, Joël
2011-03-01
Missing items are common in quality of life (QoL) questionnaires and present a challenge for research in this field. It remains unclear which of the various methods proposed to deal with missing data performs best in this context. We compared personal mean score, full information maximum likelihood, multiple imputation, and hot deck techniques using various realistic simulation scenarios of item missingness in QoL questionnaires constructed within the framework of classical test theory. Samples of 300 and 1,000 subjects were randomly drawn from the 2003 INSEE Decennial Health Survey (of 23,018 subjects representative of the French population and having completed the SF-36) and various patterns of missing data were generated according to three different item non-response rates (3, 6, and 9%) and three types of missing data (Little and Rubin's "missing completely at random," "missing at random," and "missing not at random"). The missing data methods were evaluated in terms of accuracy and precision for the analysis of one descriptive and one association parameter for three different scales of the SF-36. For all item non-response rates and types of missing data, multiple imputation and full information maximum likelihood appeared superior to the personal mean score and especially to hot deck in terms of accuracy and precision; however, the use of personal mean score was associated with insignificant bias (relative bias personal mean score appears nonetheless appropriate for dealing with items missing from completed SF-36 questionnaires in most situations of routine use. These results can reasonably be extended to other questionnaires constructed according to classical test theory.
Lusiana, Evellin Dewi
2017-12-01
The parameters of binary probit regression model are commonly estimated by using Maximum Likelihood Estimation (MLE) method. However, MLE method has limitation if the binary data contains separation. Separation is the condition where there are one or several independent variables that exactly grouped the categories in binary response. It will result the estimators of MLE method become non-convergent, so that they cannot be used in modeling. One of the effort to resolve the separation is using Firths approach instead. This research has two aims. First, to identify the chance of separation occurrence in binary probit regression model between MLE method and Firths approach. Second, to compare the performance of binary probit regression model estimator that obtained by MLE method and Firths approach using RMSE criteria. Those are performed using simulation method and under different sample size. The results showed that the chance of separation occurrence in MLE method for small sample size is higher than Firths approach. On the other hand, for larger sample size, the probability decreased and relatively identic between MLE method and Firths approach. Meanwhile, Firths estimators have smaller RMSE than MLEs especially for smaller sample sizes. But for larger sample sizes, the RMSEs are not much different. It means that Firths estimators outperformed MLE estimator.
Regression to fuzziness method for estimation of remaining useful life in power plant components
Alamaniotis, Miltiadis; Grelle, Austin; Tsoukalas, Lefteri H.
2014-10-01
Mitigation of severe accidents in power plants requires the reliable operation of all systems and the on-time replacement of mechanical components. Therefore, the continuous surveillance of power systems is a crucial concern for the overall safety, cost control, and on-time maintenance of a power plant. In this paper a methodology called regression to fuzziness is presented that estimates the remaining useful life (RUL) of power plant components. The RUL is defined as the difference between the time that a measurement was taken and the estimated failure time of that component. The methodology aims to compensate for a potential lack of historical data by modeling an expert's operational experience and expertise applied to the system. It initially identifies critical degradation parameters and their associated value range. Once completed, the operator's experience is modeled through fuzzy sets which span the entire parameter range. This model is then synergistically used with linear regression and a component's failure point to estimate the RUL. The proposed methodology is tested on estimating the RUL of a turbine (the basic electrical generating component of a power plant) in three different cases. Results demonstrate the benefits of the methodology for components for which operational data is not readily available and emphasize the significance of the selection of fuzzy sets and the effect of knowledge representation on the predicted output. To verify the effectiveness of the methodology, it was benchmarked against the data-based simple linear regression model used for predictions which was shown to perform equal or worse than the presented methodology. Furthermore, methodology comparison highlighted the improvement in estimation offered by the adoption of appropriate of fuzzy sets for parameter representation.
Amini, Payam; Maroufizadeh, Saman; Samani, Reza Omani; Hamidi, Omid; Sepidarkish, Mahdi
2017-06-01
Preterm birth (PTB) is a leading cause of neonatal death and the second biggest cause of death in children under five years of age. The objective of this study was to determine the prevalence of PTB and its associated factors using logistic regression and decision tree classification methods. This cross-sectional study was conducted on 4,415 pregnant women in Tehran, Iran, from July 6-21, 2015. Data were collected by a researcher-developed questionnaire through interviews with mothers and review of their medical records. To evaluate the accuracy of the logistic regression and decision tree methods, several indices such as sensitivity, specificity, and the area under the curve were used. The PTB rate was 5.5% in this study. The logistic regression outperformed the decision tree for the classification of PTB based on risk factors. Logistic regression showed that multiple pregnancies, mothers with preeclampsia, and those who conceived with assisted reproductive technology had an increased risk for PTB ( p logistic regression model for the classification of risk groups for PTB.
Significance testing in ridge regression for genetic data
Directory of Open Access Journals (Sweden)
De Iorio Maria
2011-09-01
Full Text Available Abstract Background Technological developments have increased the feasibility of large scale genetic association studies. Densely typed genetic markers are obtained using SNP arrays, next-generation sequencing technologies and imputation. However, SNPs typed using these methods can be highly correlated due to linkage disequilibrium among them, and standard multiple regression techniques fail with these data sets due to their high dimensionality and correlation structure. There has been increasing interest in using penalised regression in the analysis of high dimensional data. Ridge regression is one such penalised regression technique which does not perform variable selection, instead estimating a regression coefficient for each predictor variable. It is therefore desirable to obtain an estimate of the significance of each ridge regression coefficient. Results We develop and evaluate a test of significance for ridge regression coefficients. Using simulation studies, we demonstrate that the performance of the test is comparable to that of a permutation test, with the advantage of a much-reduced computational cost. We introduce the p-value trace, a plot of the negative logarithm of the p-values of ridge regression coefficients with increasing shrinkage parameter, which enables the visualisation of the change in p-value of the regression coefficients with increasing penalisation. We apply the proposed method to a lung cancer case-control data set from EPIC, the European Prospective Investigation into Cancer and Nutrition. Conclusions The proposed test is a useful alternative to a permutation test for the estimation of the significance of ridge regression coefficients, at a much-reduced computational cost. The p-value trace is an informative graphical tool for evaluating the results of a test of significance of ridge regression coefficients as the shrinkage parameter increases, and the proposed test makes its production computationally feasible.
Determination of benzo(apyrene content in PM10 using regression methods
Directory of Open Access Journals (Sweden)
Jacek Gębicki
2015-12-01
Full Text Available The paper presents an attempt of application of multidimensional linear regression to estimation of an empirical model describing the factors influencing on B(aP content in suspended dust PM10 in Olsztyn and Elbląg city regions between 2010 and 2013. During this period annual average concentration of B(aP in PM10 exceeded the admissible level 1.5-3 times. Conducted investigations confirm that the reasons of B(aP concentration increase are low-efficiency individual home heat stations or low-temperature heat sources, which are responsible for so-called low emission during heating period. Dependences between the following quantities were analysed: concentration of PM10 dust in air, air temperature, wind velocity, air humidity. A measure of model fitting to actual B(aP concentration in PM10 was the coefficient of determination of the model. Application of multidimensional linear regression yielded the equations characterized by high values of the coefficient of determination of the model, especially during heating season. This parameter ranged from 0.54 to 0.80 during the analyzed period.
Boucher, Thomas F.; Ozanne, Marie V.; Carmosino, Marco L.; Dyar, M. Darby; Mahadevan, Sridhar; Breves, Elly A.; Lepore, Kate H.; Clegg, Samuel M.
2015-05-01
The ChemCam instrument on the Mars Curiosity rover is generating thousands of LIBS spectra and bringing interest in this technique to public attention. The key to interpreting Mars or any other types of LIBS data are calibrations that relate laboratory standards to unknowns examined in other settings and enable predictions of chemical composition. Here, LIBS spectral data are analyzed using linear regression methods including partial least squares (PLS-1 and PLS-2), principal component regression (PCR), least absolute shrinkage and selection operator (lasso), elastic net, and linear support vector regression (SVR-Lin). These were compared against results from nonlinear regression methods including kernel principal component regression (K-PCR), polynomial kernel support vector regression (SVR-Py) and k-nearest neighbor (kNN) regression to discern the most effective models for interpreting chemical abundances from LIBS spectra of geological samples. The results were evaluated for 100 samples analyzed with 50 laser pulses at each of five locations averaged together. Wilcoxon signed-rank tests were employed to evaluate the statistical significance of differences among the nine models using their predicted residual sum of squares (PRESS) to make comparisons. For MgO, SiO2, Fe2O3, CaO, and MnO, the sparse models outperform all the others except for linear SVR, while for Na2O, K2O, TiO2, and P2O5, the sparse methods produce inferior results, likely because their emission lines in this energy range have lower transition probabilities. The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of the linear models. Nonlinear methods tend to overfit the data and predict less accurately, while the linear methods proved to be more generalizable with better predictive performance. These results are attributed to the high dimensionality of the data (6144 channels
Directory of Open Access Journals (Sweden)
Xiangbing Zhou
2018-04-01
Full Text Available Rapidly growing GPS (Global Positioning System trajectories hide much valuable information, such as city road planning, urban travel demand, and population migration. In order to mine the hidden information and to capture better clustering results, a trajectory regression clustering method (an unsupervised trajectory clustering method is proposed to reduce local information loss of the trajectory and to avoid getting stuck in the local optimum. Using this method, we first define our new concept of trajectory clustering and construct a novel partitioning (angle-based partitioning method of line segments; second, the Lagrange-based method and Hausdorff-based K-means++ are integrated in fuzzy C-means (FCM clustering, which are used to maintain the stability and the robustness of the clustering process; finally, least squares regression model is employed to achieve regression clustering of the trajectory. In our experiment, the performance and effectiveness of our method is validated against real-world taxi GPS data. When comparing our clustering algorithm with the partition-based clustering algorithms (K-means, K-median, and FCM, our experimental results demonstrate that the presented method is more effective and generates a more reasonable trajectory.
A method to determine the necessity for global signal regression in resting-state fMRI studies.
Chen, Gang; Chen, Guangyu; Xie, Chunming; Ward, B Douglas; Li, Wenjun; Antuono, Piero; Li, Shi-Jiang
2012-12-01
In resting-state functional MRI studies, the global signal (operationally defined as the global average of resting-state functional MRI time courses) is often considered a nuisance effect and commonly removed in preprocessing. This global signal regression method can introduce artifacts, such as false anticorrelated resting-state networks in functional connectivity analyses. Therefore, the efficacy of this technique as a correction tool remains questionable. In this article, we establish that the accuracy of the estimated global signal is determined by the level of global noise (i.e., non-neural noise that has a global effect on the resting-state functional MRI signal). When the global noise level is low, the global signal resembles the resting-state functional MRI time courses of the largest cluster, but not those of the global noise. Using real data, we demonstrate that the global signal is strongly correlated with the default mode network components and has biological significance. These results call into question whether or not global signal regression should be applied. We introduce a method to quantify global noise levels. We show that a criteria for global signal regression can be found based on the method. By using the criteria, one can determine whether to include or exclude the global signal regression in minimizing errors in functional connectivity measures. Copyright © 2012 Wiley Periodicals, Inc.
Kolasa-Wiecek, Alicja
2015-04-01
The energy sector in Poland is the source of 81% of greenhouse gas (GHG) emissions. Poland, among other European Union countries, occupies a leading position with regard to coal consumption. Polish energy sector actively participates in efforts to reduce GHG emissions to the atmosphere, through a gradual decrease of the share of coal in the fuel mix and development of renewable energy sources. All evidence which completes the knowledge about issues related to GHG emissions is a valuable source of information. The article presents the results of modeling of GHG emissions which are generated by the energy sector in Poland. For a better understanding of the quantitative relationship between total consumption of primary energy and greenhouse gas emission, multiple stepwise regression model was applied. The modeling results of CO2 emissions demonstrate a high relationship (0.97) with the hard coal consumption variable. Adjustment coefficient of the model to actual data is high and equal to 95%. The backward step regression model, in the case of CH4 emission, indicated the presence of hard coal (0.66), peat and fuel wood (0.34), solid waste fuels, as well as other sources (-0.64) as the most important variables. The adjusted coefficient is suitable and equals R2=0.90. For N2O emission modeling the obtained coefficient of determination is low and equal to 43%. A significant variable influencing the amount of N2O emission is the peat and wood fuel consumption. Copyright © 2015. Published by Elsevier B.V.
Efectivity of Additive Spline for Partial Least Square Method in Regression Model Estimation
Directory of Open Access Journals (Sweden)
Ahmad Bilfarsah
2005-04-01
Full Text Available Additive Spline of Partial Least Square method (ASPL as one generalization of Partial Least Square (PLS method. ASPLS method can be acommodation to non linear and multicollinearity case of predictor variables. As a principle, The ASPLS method approach is cahracterized by two idea. The first is to used parametric transformations of predictors by spline function; the second is to make ASPLS components mutually uncorrelated, to preserve properties of the linear PLS components. The performance of ASPLS compared with other PLS method is illustrated with the fisher economic application especially the tuna fish production.
Spady, Richard; Stouli, Sami
2012-01-01
We propose dual regression as an alternative to the quantile regression process for the global estimation of conditional distribution functions under minimal assumptions. Dual regression provides all the interpretational power of the quantile regression process while avoiding the need for repairing the intersecting conditional quantile surfaces that quantile regression often produces in practice. Our approach introduces a mathematical programming characterization of conditional distribution f...
Olive, David J
2017-01-01
This text covers both multiple linear regression and some experimental design models. The text uses the response plot to visualize the model and to detect outliers, does not assume that the error distribution has a known parametric distribution, develops prediction intervals that work when the error distribution is unknown, suggests bootstrap hypothesis tests that may be useful for inference after variable selection, and develops prediction regions and large sample theory for the multivariate linear regression model that has m response variables. A relationship between multivariate prediction regions and confidence regions provides a simple way to bootstrap confidence regions. These confidence regions often provide a practical method for testing hypotheses. There is also a chapter on generalized linear models and generalized additive models. There are many R functions to produce response and residual plots, to simulate prediction intervals and hypothesis tests, to detect outliers, and to choose response trans...
International Nuclear Information System (INIS)
Sun Zhong-Hua; Jiang Fan
2010-01-01
In this paper a new continuous variable called core-ratio is defined to describe the probability for a residue to be in a binding site, thereby replacing the previous binary description of the interface residue using 0 and 1. So we can use the support vector machine regression method to fit the core-ratio value and predict the protein binding sites. We also design a new group of physical and chemical descriptors to characterize the binding sites. The new descriptors are more effective, with an averaging procedure used. Our test shows that much better prediction results can be obtained by the support vector regression (SVR) method than by the support vector classification method. (rapid communication)
Directory of Open Access Journals (Sweden)
Liyun Su
2012-01-01
Full Text Available We introduce the extension of local polynomial fitting to the linear heteroscedastic regression model. Firstly, the local polynomial fitting is applied to estimate heteroscedastic function, then the coefficients of regression model are obtained by using generalized least squares method. One noteworthy feature of our approach is that we avoid the testing for heteroscedasticity by improving the traditional two-stage method. Due to nonparametric technique of local polynomial estimation, we do not need to know the heteroscedastic function. Therefore, we can improve the estimation precision, when the heteroscedastic function is unknown. Furthermore, we focus on comparison of parameters and reach an optimal fitting. Besides, we verify the asymptotic normality of parameters based on numerical simulations. Finally, this approach is applied to a case of economics, and it indicates that our method is surely effective in finite-sample situations.
Gilstrap, Donald L.
2013-01-01
In addition to qualitative methods presented in chaos and complexity theories in educational research, this article addresses quantitative methods that may show potential for future research studies. Although much in the social and behavioral sciences literature has focused on computer simulations, this article explores current chaos and…
Forecast daily indices of solar activity, F10.7, using support vector regression method
International Nuclear Information System (INIS)
Huang Cong; Liu Dandan; Wang Jingsong
2009-01-01
The 10.7 cm solar radio flux (F10.7), the value of the solar radio emission flux density at a wavelength of 10.7 cm, is a useful index of solar activity as a proxy for solar extreme ultraviolet radiation. It is meaningful and important to predict F10.7 values accurately for both long-term (months-years) and short-term (days) forecasting, which are often used as inputs in space weather models. This study applies a novel neural network technique, support vector regression (SVR), to forecasting daily values of F10.7. The aim of this study is to examine the feasibility of SVR in short-term F10.7 forecasting. The approach, based on SVR, reduces the dimension of feature space in the training process by using a kernel-based learning algorithm. Thus, the complexity of the calculation becomes lower and a small amount of training data will be sufficient. The time series of F10.7 from 2002 to 2006 are employed as the data sets. The performance of the approach is estimated by calculating the norm mean square error and mean absolute percentage error. It is shown that our approach can perform well by using fewer training data points than the traditional neural network. (research paper)
International Nuclear Information System (INIS)
Yang, Jianhong; Yi, Cancan; Xu, Jinwu; Ma, Xianghong
2015-01-01
A new LIBS quantitative analysis method based on analytical line adaptive selection and Relevance Vector Machine (RVM) regression model is proposed. First, a scheme of adaptively selecting analytical line is put forward in order to overcome the drawback of high dependency on a priori knowledge. The candidate analytical lines are automatically selected based on the built-in characteristics of spectral lines, such as spectral intensity, wavelength and width at half height. The analytical lines which will be used as input variables of regression model are determined adaptively according to the samples for both training and testing. Second, an LIBS quantitative analysis method based on RVM is presented. The intensities of analytical lines and the elemental concentrations of certified standard samples are used to train the RVM regression model. The predicted elemental concentration analysis results will be given with a form of confidence interval of probabilistic distribution, which is helpful for evaluating the uncertainness contained in the measured spectra. Chromium concentration analysis experiments of 23 certified standard high-alloy steel samples have been carried out. The multiple correlation coefficient of the prediction was up to 98.85%, and the average relative error of the prediction was 4.01%. The experiment results showed that the proposed LIBS quantitative analysis method achieved better prediction accuracy and better modeling robustness compared with the methods based on partial least squares regression, artificial neural network and standard support vector machine. - Highlights: • Both training and testing samples are considered for analytical lines selection. • The analytical lines are auto-selected based on the built-in characteristics of spectral lines. • The new method can achieve better prediction accuracy and modeling robustness. • Model predictions are given with confidence interval of probabilistic distribution
Directory of Open Access Journals (Sweden)
Sara Mortaz Hejri
2013-01-01
Full Text Available Background: One of the methods used for standard setting is the borderline regression method (BRM. This study aims to assess the reliability of BRM when the pass-fail standard in an objective structured clinical examination (OSCE was calculated by averaging the BRM standards obtained for each station separately. Materials and Methods: In nine stations of the OSCE with direct observation the examiners gave each student a checklist score and a global score. Using a linear regression model for each station, we calculated the checklist score cut-off on the regression equation for the global scale cut-off set at 2. The OSCE pass-fail standard was defined as the average of all station′s standard. To determine the reliability, the root mean square error (RMSE was calculated. The R2 coefficient and the inter-grade discrimination were calculated to assess the quality of OSCE. Results: The mean total test score was 60.78. The OSCE pass-fail standard and its RMSE were 47.37 and 0.55, respectively. The R2 coefficients ranged from 0.44 to 0.79. The inter-grade discrimination score varied greatly among stations. Conclusion: The RMSE of the standard was very small indicating that BRM is a reliable method of setting standard for OSCE, which has the advantage of providing data for quality assurance.
International Nuclear Information System (INIS)
Ballini, J.-P.; Cazes, P.; Turpin, P.-Y.
1976-01-01
Analysing the histogram of anode pulse amplitudes allows a discussion of the hypothesis that has been proposed to account for the statistical processes of secondary multiplication in a photomultiplier. In an earlier work, good agreement was obtained between experimental and reconstructed spectra, assuming a first dynode distribution including two Poisson distributions of distinct mean values. This first approximation led to a search for a method which could give the weights of several Poisson distributions of distinct mean values. Three methods have been briefly exposed: classical linear regression, constraint regression (d'Esopo's method), and regression on variables subject to error. The use of these methods gives an approach of the frequency function which represents the dispersion of the punctual mean gain around the whole first dynode mean gain value. Comparison between this function and the one employed in Polya distribution allows the statement that the latter is inadequate to describe the statistical process of secondary multiplication. Numerous spectra obtained with two kinds of photomultiplier working under different physical conditions have been analysed. Then two points are discussed: - Does the frequency function represent the dynode structure and the interdynode collection process. - Is the model (the multiplication process of all dynodes but the first one, is Poissonian) valid whatever the photomultiplier and the utilization conditions. (Auth.)
Borodachev, S. M.
2016-06-01
The simple derivation of recursive least squares (RLS) method equations is given as special case of Kalman filter estimation of a constant system state under changing observation conditions. A numerical example illustrates application of RLS to multicollinearity problem.
Huang, Lei
2015-01-01
To solve the problem in which the conventional ARMA modeling methods for gyro random noise require a large number of samples and converge slowly, an ARMA modeling method using a robust Kalman filtering is developed. The ARMA model parameters are employed as state arguments. Unknown time-varying estimators of observation noise are used to achieve the estimated mean and variance of the observation noise. Using the robust Kalman filtering, the ARMA model parameters are estimated accurately. The developed ARMA modeling method has the advantages of a rapid convergence and high accuracy. Thus, the required sample size is reduced. It can be applied to modeling applications for gyro random noise in which a fast and accurate ARMA modeling method is required. PMID:26437409
Data Editing and Imputation in Business Surveys Using “R”
Directory of Open Access Journals (Sweden)
Elena Romascanu
2014-06-01
Full Text Available Purpose – Missing data are a recurring problem that can cause bias or lead to inefficient analyses. The objective of this paper is a direct comparison between the two statistical software features R and SPSS, in order to take full advantage of the existing automated methods for data editing process and imputation in business surveys (with a proper design of consistency rules as a partial alternative to the manual editing of data. Approach – The comparison of different methods on editing surveys data, in R with the ‘editrules’ and ‘survey’ packages because inside those, exist commonly used transformations in ofﬁcial statistics, as visualization of missing values pattern using ‘Amelia’ and ‘VIM’ packages, imputation approaches for longitudinal data using ‘VIMGUI’ and a comparison of another statistical software performance on the same features, such as SPSS. Findings – Data on business statistics received by NIS’s (National Institute of Statistics are not ready to be used for direct analysis due to in-record inconsistencies, errors and missing values from the collected data sets. The appropriate automatic methods from R packages, offers the ability to set the erroneous fields in edit-violating records, to verify the results after the imputation of missing values providing for users a flexible, less time consuming approach and easy to perform automation in R than in SPSS Macros syntax situations, when macros are very handy.
Bolarinwa, O A; Adeola, O
2012-12-01
Digestible and metabolizable energy contents of feed ingredients for pigs can be determined by direct or indirect methods. There are situations when only the indirect approach is suitable and the regression method is a robust indirect approach. This study was conducted to compare the direct and regression methods for determining the energy value of wheat for pigs. Twenty-four barrows with an average initial BW of 31 kg were assigned to 4 diets in a randomized complete block design. The 4 diets consisted of 969 g wheat/kg plus minerals and vitamins (sole wheat) for the direct method, corn (Zea mays)-soybean (Glycine max) meal reference diet (RD), RD + 300 g wheat/kg, and RD + 600 g wheat/kg. The 3 corn-soybean meal diets were used for the regression method and wheat replaced the energy-yielding ingredients, corn and soybean meal, so that the same ratio of corn and soybean meal across the experimental diets was maintained. The wheat used was analyzed to contain 883 g DM, 15.2 g N, and 3.94 Mcal GE/kg. Each diet was fed to 6 barrows in individual metabolism crates for a 5-d acclimation followed by a 5-d total but separate collection of feces and urine. The DE and ME for the sole wheat diet were 3.83 and 3.77 Mcal/kg DM, respectively. Because the sole wheat diet contained 969 g wheat/kg, these translate to 3.95 Mcal DE/kg DM and 3.89 Mcal ME/kg DM. The RD used for the regression approach yielded 4.00 Mcal DE and 3.91 Mcal ME/kg DM diet. Increasing levels of wheat in the RD linearly reduced (P direct method (3.95 and 3.89 Mcal/kg DM) did not differ (0.78 < P < 0.89) from those obtained using the regression method (3.96 and 3.88 Mcal/kg DM).
Liou, Jyun-you; Smith, Elliot H.; Bateman, Lisa M.; McKhann, Guy M., II; Goodman, Robert R.; Greger, Bradley; Davis, Tyler S.; Kellis, Spencer S.; House, Paul A.; Schevon, Catherine A.
2017-08-01
Objective. Epileptiform discharges, an electrophysiological hallmark of seizures, can propagate across cortical tissue in a manner similar to traveling waves. Recent work has focused attention on the origination and propagation patterns of these discharges, yielding important clues to their source location and mechanism of travel. However, systematic studies of methods for measuring propagation are lacking. Approach. We analyzed epileptiform discharges in microelectrode array recordings of human seizures. The array records multiunit activity and local field potentials at 400 micron spatial resolution, from a small cortical site free of obstructions. We evaluated several computationally efficient statistical methods for calculating traveling wave velocity, benchmarking them to analyses of associated neuronal burst firing. Main results. Over 90% of discharges met statistical criteria for propagation across the sampled cortical territory. Detection rate, direction and speed estimates derived from a multiunit estimator were compared to four field potential-based estimators: negative peak, maximum descent, high gamma power, and cross-correlation. Interestingly, the methods that were computationally simplest and most efficient (negative peak and maximal descent) offer non-inferior results in predicting neuronal traveling wave velocities compared to the other two, more complex methods. Moreover, the negative peak and maximal descent methods proved to be more robust against reduced spatial sampling challenges. Using least absolute deviation in place of least squares error minimized the impact of outliers, and reduced the discrepancies between local field potential-based and multiunit estimators. Significance. Our findings suggest that ictal epileptiform discharges typically take the form of exceptionally strong, rapidly traveling waves, with propagation detectable across millimeter distances. The sequential activation of neurons in space can be inferred from clinically
2012-01-01
Background Multiple Imputation as usually implemented assumes that data are Missing At Random (MAR), meaning that the underlying missing data mechanism, given the observed data, is independent of the unobserved data. To explore the sensitivity of the inferences to departures from the MAR assumption, we applied the method proposed by Carpenter et al. (2007). This approach aims to approximate inferences under a Missing Not At random (MNAR) mechanism by reweighting estimates obtained after multiple imputation where the weights depend on the assumed degree of departure from the MAR assumption. Methods The method is illustrated with epidemiological data from a surveillance system of hepatitis C virus (HCV) infection in France during the 2001–2007 period. The subpopulation studied included 4343 HCV infected patients who reported drug use. Risk factors for severe liver disease were assessed. After performing complete-case and multiple imputation analyses, we applied the sensitivity analysis to 3 risk factors of severe liver disease: past excessive alcohol consumption, HIV co-infection and infection with HCV genotype 3. Results In these data, the association between severe liver disease and HIV was underestimated, if given the observed data the chance of observing HIV status is high when this is positive. Inference for two other risk factors were robust to plausible local departures from the MAR assumption. Conclusions We have demonstrated the practical utility of, and advocate, a pragmatic widely applicable approach to exploring plausible departures from the MAR assumption post multiple imputation. We have developed guidelines for applying this approach to epidemiological studies. PMID:22681630
Energy Technology Data Exchange (ETDEWEB)
Jabr, R.A. [Electrical, Computer and Communication Engineering Department, Notre Dame University, P.O. Box 72, Zouk Mikhael, Zouk Mosbeh (Lebanon)
2006-02-15
This paper presents an implementation of the least absolute value (LAV) power system state estimator based on obtaining a sequence of solutions to the L{sub 1}-regression problem using an iteratively reweighted least squares (IRLS{sub L1}) method. The proposed implementation avoids reformulating the regression problem into standard linear programming (LP) form and consequently does not require the use of common methods of LP, such as those based on the simplex method or interior-point methods. It is shown that the IRLS{sub L1} method is equivalent to solving a sequence of linear weighted least squares (LS) problems. Thus, its implementation presents little additional effort since the sparse LS solver is common to existing LS state estimators. Studies on the termination criteria of the IRLS{sub L1} method have been carried out to determine a procedure for which the proposed estimator is more computationally efficient than a previously proposed non-linear iteratively reweighted least squares (IRLS) estimator. Indeed, it is revealed that the proposed method is a generalization of the previously reported IRLS estimator, but is based on more rigorous theory. (author)
Energy Technology Data Exchange (ETDEWEB)
Boucher, Thomas F., E-mail: boucher@cs.umass.edu [School of Computer Science, University of Massachusetts Amherst, 140 Governor' s Drive, Amherst, MA 01003, United States. (United States); Ozanne, Marie V. [Department of Astronomy, Mount Holyoke College, South Hadley, MA 01075 (United States); Carmosino, Marco L. [School of Computer Science, University of Massachusetts Amherst, 140 Governor' s Drive, Amherst, MA 01003, United States. (United States); Dyar, M. Darby [Department of Astronomy, Mount Holyoke College, South Hadley, MA 01075 (United States); Mahadevan, Sridhar [School of Computer Science, University of Massachusetts Amherst, 140 Governor' s Drive, Amherst, MA 01003, United States. (United States); Breves, Elly A.; Lepore, Kate H. [Department of Astronomy, Mount Holyoke College, South Hadley, MA 01075 (United States); Clegg, Samuel M. [Los Alamos National Laboratory, P.O. Box 1663, MS J565, Los Alamos, NM 87545 (United States)
2015-05-01
The ChemCam instrument on the Mars Curiosity rover is generating thousands of LIBS spectra and bringing interest in this technique to public attention. The key to interpreting Mars or any other types of LIBS data are calibrations that relate laboratory standards to unknowns examined in other settings and enable predictions of chemical composition. Here, LIBS spectral data are analyzed using linear regression methods including partial least squares (PLS-1 and PLS-2), principal component regression (PCR), least absolute shrinkage and selection operator (lasso), elastic net, and linear support vector regression (SVR-Lin). These were compared against results from nonlinear regression methods including kernel principal component regression (K-PCR), polynomial kernel support vector regression (SVR-Py) and k-nearest neighbor (kNN) regression to discern the most effective models for interpreting chemical abundances from LIBS spectra of geological samples. The results were evaluated for 100 samples analyzed with 50 laser pulses at each of five locations averaged together. Wilcoxon signed-rank tests were employed to evaluate the statistical significance of differences among the nine models using their predicted residual sum of squares (PRESS) to make comparisons. For MgO, SiO{sub 2}, Fe{sub 2}O{sub 3}, CaO, and MnO, the sparse models outperform all the others except for linear SVR, while for Na{sub 2}O, K{sub 2}O, TiO{sub 2}, and P{sub 2}O{sub 5}, the sparse methods produce inferior results, likely because their emission lines in this energy range have lower transition probabilities. The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of the linear models. Nonlinear methods tend to overfit the data and predict less accurately, while the linear methods proved to be more generalizable with better predictive performance. These results are attributed to the high
International Nuclear Information System (INIS)
Boucher, Thomas F.; Ozanne, Marie V.; Carmosino, Marco L.; Dyar, M. Darby; Mahadevan, Sridhar; Breves, Elly A.; Lepore, Kate H.; Clegg, Samuel M.
2015-01-01
The ChemCam instrument on the Mars Curiosity rover is generating thousands of LIBS spectra and bringing interest in this technique to public attention. The key to interpreting Mars or any other types of LIBS data are calibrations that relate laboratory standards to unknowns examined in other settings and enable predictions of chemical composition. Here, LIBS spectral data are analyzed using linear regression methods including partial least squares (PLS-1 and PLS-2), principal component regression (PCR), least absolute shrinkage and selection operator (lasso), elastic net, and linear support vector regression (SVR-Lin). These were compared against results from nonlinear regression methods including kernel principal component regression (K-PCR), polynomial kernel support vector regression (SVR-Py) and k-nearest neighbor (kNN) regression to discern the most effective models for interpreting chemical abundances from LIBS spectra of geological samples. The results were evaluated for 100 samples analyzed with 50 laser pulses at each of five locations averaged together. Wilcoxon signed-rank tests were employed to evaluate the statistical significance of differences among the nine models using their predicted residual sum of squares (PRESS) to make comparisons. For MgO, SiO 2 , Fe 2 O 3 , CaO, and MnO, the sparse models outperform all the others except for linear SVR, while for Na 2 O, K 2 O, TiO 2 , and P 2 O 5 , the sparse methods produce inferior results, likely because their emission lines in this energy range have lower transition probabilities. The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of the linear models. Nonlinear methods tend to overfit the data and predict less accurately, while the linear methods proved to be more generalizable with better predictive performance. These results are attributed to the high dimensionality of the data (6144
Directory of Open Access Journals (Sweden)
Tamer Khatib
2014-01-01
Full Text Available In this research an improved approach for sizing standalone PV system (SAPV is presented. This work is an improved work developed previously by the authors. The previous work is based on the analytical method which faced some concerns regarding the difficulty of finding the model’s coefficients. Therefore, the proposed approach in this research is based on a combination of an analytical method and a machine learning approach for a generalized artificial neural network (GRNN. The GRNN assists to predict the optimal size of a PV system using the geographical coordinates of the targeted site instead of using mathematical formulas. Employing the GRNN facilitates the use of a previously developed method by the authors and avoids some of its drawbacks. The approach has been tested using data from five Malaysian sites. According to the results, the proposed method can be efficiently used for SAPV sizing whereas the proposed GRNN based model predicts the sizing curves of the PV system accurately with a prediction error of 0.6%. Moreover, hourly meteorological and load demand data are used in this research in order to consider the uncertainty of the solar energy and the load demand.
Comparison of Sparse and Jack-knife partial least squares regression methods for variable selection
DEFF Research Database (Denmark)
Karaman, Ibrahim; Qannari, El Mostafa; Martens, Harald
2013-01-01
The objective of this study was to compare two different techniques of variable selection, Sparse PLSR and Jack-knife PLSR, with respect to their predictive ability and their ability to identify relevant variables. Sparse PLSR is a method that is frequently used in genomics, whereas Jack-knife PL...
Using a Linear Regression Method to Detect Outliers in IRT Common Item Equating
He, Yong; Cui, Zhongmin; Fang, Yu; Chen, Hanwei
2013-01-01
Common test items play an important role in equating alternate test forms under the common item nonequivalent groups design. When the item response theory (IRT) method is applied in equating, inconsistent item parameter estimates among common items can lead to large bias in equated scores. It is prudent to evaluate inconsistency in parameter…
Sun, L.G.; De Visser, C.C.; Chu, Q.P.; Mulder, J.A.
2012-01-01
The optimality of the kernel number and kernel centers plays a significant role in determining the approximation power of nearly all kernel methods. However, the process of choosing optimal kernels is always formulated as a global optimization task, which is hard to accomplish. Recently, an
Asghari, Mehdi Poursheikhali; Hayatshahi, Sayyed Hamed Sadat; Abdolmaleki, Parviz
2012-01-01
From both the structural and functional points of view, β-turns play important biological roles in proteins. In the present study, a novel two-stage hybrid procedure has been developed to identify β-turns in proteins. Binary logistic regression was initially used for the first time to select significant sequence parameters in identification of β-turns due to a re-substitution test procedure. Sequence parameters were consisted of 80 amino acid positional occurrences and 20 amino acid percentages in sequence. Among these parameters, the most significant ones which were selected by binary logistic regression model, were percentages of Gly, Ser and the occurrence of Asn in position i+2, respectively, in sequence. These significant parameters have the highest effect on the constitution of a β-turn sequence. A neural network model was then constructed and fed by the parameters selected by binary logistic regression to build a hybrid predictor. The networks have been trained and tested on a non-homologous dataset of 565 protein chains. With applying a nine fold cross-validation test on the dataset, the network reached an overall accuracy (Qtotal) of 74, which is comparable with results of the other β-turn prediction methods. In conclusion, this study proves that the parameter selection ability of binary logistic regression together with the prediction capability of neural networks lead to the development of more precise models for identifying β-turns in proteins.
Directory of Open Access Journals (Sweden)
Xiaoyan Yang
2018-04-01
Full Text Available The Advanced Spaceborne Thermal-Emission and Reflection Radiometer Global Digital Elevation Model (ASTER GDEM is important to a wide range of geographical and environmental studies. Its accuracy, to some extent associated with land-use types reflecting topography, vegetation coverage, and human activities, impacts the results and conclusions of these studies. In order to improve the accuracy of ASTER GDEM prior to its application, we investigated ASTER GDEM errors based on individual land-use types and proposed two linear regression calibration methods, one considering only land use-specific errors and the other considering the impact of both land-use and topography. Our calibration methods were tested on the coastal prefectural city of Lianyungang in eastern China. Results indicate that (1 ASTER GDEM is highly accurate for rice, wheat, grass and mining lands but less accurate for scenic, garden, wood and bare lands; (2 despite improvements in ASTER GDEM2 accuracy, multiple linear regression calibration requires more data (topography and a relatively complex calibration process; (3 simple linear regression calibration proves a practicable and simplified means to systematically investigate and improve the impact of land-use on ASTER GDEM accuracy. Our method is applicable to areas with detailed land-use data based on highly accurate field-based point-elevation measurements.
Imputation of microsatellite alleles from dense SNP genotypes for parental verification
Directory of Open Access Journals (Sweden)
Matthew eMcclure
2012-08-01
Full Text Available Microsatellite (MS markers have recently been used for parental verification and are still the international standard despite higher cost, error rate, and turnaround time compared with Single Nucleotide Polymorphisms (SNP-based assays. Despite domestic and international interest from producers and research communities, no viable means currently exist to verify parentage for an individual unless all familial connections were analyzed using the same DNA marker type (MS or SNP. A simple and cost-effective method was devised to impute MS alleles from SNP haplotypes within breeds. For some MS, imputation results may allow inference across breeds. A total of 347 dairy cattle representing 4 dairy breeds (Brown Swiss, Guernsey, Holstein, and Jersey were used to generate reference haplotypes. This approach has been verified (>98% accurate for imputing the International Society of Animal Genetics (ISAG recommended panel of 12 MS for cattle parentage verification across a validation set of 1,307 dairy animals.. Implementation of this method will allow producers and breed associations to transition to SNP-based parentage verification utilizing MS genotypes from historical data on parents where SNP genotypes are missing. This approach may be applicable to additional cattle breeds and other species that wish to migrate from MS- to SNP- based parental verification.
TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION.
Allen, Genevera I; Tibshirani, Robert
2010-06-01
Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable , meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal , in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.
Directory of Open Access Journals (Sweden)
Hukharnsusatrue, A.
2005-11-01
Full Text Available The objective of this research is to compare multiple regression coefficients estimating methods with existence of multicollinearity among independent variables. The estimation methods are Ordinary Least Squares method (OLS, Restricted Least Squares method (RLS, Restricted Ridge Regression method (RRR and Restricted Liu method (RL when restrictions are true and restrictions are not true. The study used the Monte Carlo Simulation method. The experiment was repeated 1,000 times under each situation. The analyzed results of the data are demonstrated as follows. CASE 1: The restrictions are true. In all cases, RRR and RL methods have a smaller Average Mean Square Error (AMSE than OLS and RLS method, respectively. RRR method provides the smallest AMSE when the level of correlations is high and also provides the smallest AMSE for all level of correlations and all sample sizes when standard deviation is equal to 5. However, RL method provides the smallest AMSE when the level of correlations is low and middle, except in the case of standard deviation equal to 3, small sample sizes, RRR method provides the smallest AMSE.The AMSE varies with, most to least, respectively, level of correlations, standard deviation and number of independent variables but inversely with to sample size.CASE 2: The restrictions are not true.In all cases, RRR method provides the smallest AMSE, except in the case of standard deviation equal to 1 and error of restrictions equal to 5%, OLS method provides the smallest AMSE when the level of correlations is low or median and there is a large sample size, but the small sample sizes, RL method provides the smallest AMSE. In addition, when error of restrictions is increased, OLS method provides the smallest AMSE for all level, of correlations and all sample sizes, except when the level of correlations is high and sample sizes small. Moreover, the case OLS method provides the smallest AMSE, the most RLS method has a smaller AMSE than
A computer program for uncertainty analysis integrating regression and Bayesian methods
Lu, Dan; Ye, Ming; Hill, Mary C.; Poeter, Eileen P.; Curtis, Gary
2014-01-01
This work develops a new functionality in UCODE_2014 to evaluate Bayesian credible intervals using the Markov Chain Monte Carlo (MCMC) method. The MCMC capability in UCODE_2014 is based on the FORTRAN version of the differential evolution adaptive Metropolis (DREAM) algorithm of Vrugt et al. (2009), which estimates the posterior probability density function of model parameters in high-dimensional and multimodal sampling problems. The UCODE MCMC capability provides eleven prior probability distributions and three ways to initialize the sampling process. It evaluates parametric and predictive uncertainties and it has parallel computing capability based on multiple chains to accelerate the sampling process. This paper tests and demonstrates the MCMC capability using a 10-dimensional multimodal mathematical function, a 100-dimensional Gaussian function, and a groundwater reactive transport model. The use of the MCMC capability is made straightforward and flexible by adopting the JUPITER API protocol. With the new MCMC capability, UCODE_2014 can be used to calculate three types of uncertainty intervals, which all can account for prior information: (1) linear confidence intervals which require linearity and Gaussian error assumptions and typically 10s–100s of highly parallelizable model runs after optimization, (2) nonlinear confidence intervals which require a smooth objective function surface and Gaussian observation error assumptions and typically 100s–1,000s of partially parallelizable model runs after optimization, and (3) MCMC Bayesian credible intervals which require few assumptions and commonly 10,000s–100,000s or more partially parallelizable model runs. Ready access allows users to select methods best suited to their work, and to compare methods in many circumstances.
CSIR Research Space (South Africa)
Gregor, Luke
2017-12-01
Full Text Available understanding with spatially integrated air–sea flux estimates (Fay and McKinley, 2014). Conversely, ocean biogeochemical process models are good tools for mechanis- tic understanding, but fail to represent the seasonality of CO2 fluxes in the Southern Ocean... of including coordinate variables as proxies of 1pCO2 in the empirical methods. In the inter- comparison study by Rödenbeck et al. (2015) proxies typi- cally include, but are not limited to, sea surface temperature (SST), chlorophyll a (Chl a), mixed layer...
Consistency analysis of subspace identification methods based on a linear regression approach
DEFF Research Database (Denmark)
Knudsen, Torben
2001-01-01
In the literature results can be found which claim consistency for the subspace method under certain quite weak assumptions. Unfortunately, a new result gives a counter example showing inconsistency under these assumptions and then gives new more strict sufficient assumptions which however does n...... not include important model structures as e.g. Box-Jenkins. Based on a simple least squares approach this paper shows the possible inconsistency under the weak assumptions and develops only slightly stricter assumptions sufficient for consistency and which includes any model structure...
International Nuclear Information System (INIS)
Sambou, Soussou
2004-01-01
In flood forecasting modelling, large basins are often considered as hydrological systems with multiple inputs and one output. Inputs are hydrological variables such rainfall, runoff and physical characteristics of basin; output is runoff. Relating inputs to output can be achieved using deterministic, conceptual, or stochastic models. Rainfall runoff models generally lack of accuracy. Physical hydrological processes based models, either deterministic or conceptual are highly data requirement demanding and by the way very complex. Stochastic multiple input-output models, using only historical chronicles of hydrological variables particularly runoff are by the way very popular among the hydrologists for large river basin flood forecasting. Application is made on the Senegal River upstream of Bakel, where the River is formed by the main branch, Bafing, and two tributaries, Bakoye and Faleme; Bafing being regulated by Manantaly Dam. A three inputs and one output model has been used for flood forecasting on Bakel. Influence of the lead forecasting, and of the three inputs taken separately, then associated two by two, and altogether has been verified using a dimensionless variance as criterion of quality. Inadequacies occur generally between model output and observations; to put model in better compliance with current observations, we have compared four parameter updating procedure, recursive least squares, Kalman filtering, stochastic gradient method, iterative method, and an AR errors forecasting model. A combination of these model updating have been used in real time flood forecasting.(Author)
High cycle fatigue test and regression methods of S-N curve
International Nuclear Information System (INIS)
Kim, D. W.; Park, J. Y.; Kim, W. G.; Yoon, J. H.
2011-11-01
The fatigue design curve in the ASME Boiler and Pressure Vessel Code Section III are based on the assumption that fatigue life is infinite after 106 cycles. This is because standard fatigue testing equipment prior to the past decades was limited in speed to less than 200 cycles per second. Traditional servo-hydraulic machines work at frequency of 50 Hz. Servo-hydraulic machines working at 1000 Hz have been developed after 1997. This machines allow high frequency and displacement of up to ±0.1 mm and dynamic load of ±20 kN are guaranteed. The frequency of resonant fatigue test machine is 50-250 Hz. Various forced vibration-based system works at 500 Hz or 1.8 kHz. Rotating bending machines allow testing frequency at 0.1-200 Hz. The main advantage of ultrasonic fatigue testing at 20 kHz is performing Although S-N curve is determined by experiment, the fatigue strength corresponding to a given fatigue life should be determined by statistical method considering the scatter of fatigue properties. In this report, the statistical methods for evaluation of fatigue test data is investigated
Variable selection methods in PLS regression - a comparison study on metabolomics data
DEFF Research Database (Denmark)
Karaman, İbrahim; Hedemann, Mette Skou; Knudsen, Knud Erik Bach
. The aim of the metabolomics study was to investigate the metabolic profile in pigs fed various cereal fractions with special attention to the metabolism of lignans using LC-MS based metabolomic approach. References 1. Lê Cao KA, Rossouw D, Robert-Granié C, Besse P: A Sparse PLS for Variable Selection when...... integrated approach. Due to the high number of variables in data sets (both raw data and after peak picking) the selection of important variables in an explorative analysis is difficult, especially when different data sets of metabolomics data need to be related. Variable selection (or removal of irrelevant...... different strategies for variable selection on PLSR method were considered and compared with respect to selected subset of variables and the possibility for biological validation. Sparse PLSR [1] as well as PLSR with Jack-knifing [2] was applied to data in order to achieve variable selection prior...
Freitas, Alex A; Limbu, Kriti; Ghafourian, Taravat
2015-01-01
Volume of distribution is an important pharmacokinetic property that indicates the extent of a drug's distribution in the body tissues. This paper addresses the problem of how to estimate the apparent volume of distribution at steady state (Vss) of chemical compounds in the human body using decision tree-based regression methods from the area of data mining (or machine learning). Hence, the pros and cons of several different types of decision tree-based regression methods have been discussed. The regression methods predict Vss using, as predictive features, both the compounds' molecular descriptors and the compounds' tissue:plasma partition coefficients (Kt:p) - often used in physiologically-based pharmacokinetics. Therefore, this work has assessed whether the data mining-based prediction of Vss can be made more accurate by using as input not only the compounds' molecular descriptors but also (a subset of) their predicted Kt:p values. Comparison of the models that used only molecular descriptors, in particular, the Bagging decision tree (mean fold error of 2.33), with those employing predicted Kt:p values in addition to the molecular descriptors, such as the Bagging decision tree using adipose Kt:p (mean fold error of 2.29), indicated that the use of predicted Kt:p values as descriptors may be beneficial for accurate prediction of Vss using decision trees if prior feature selection is applied. Decision tree based models presented in this work have an accuracy that is reasonable and similar to the accuracy of reported Vss inter-species extrapolations in the literature. The estimation of Vss for new compounds in drug discovery will benefit from methods that are able to integrate large and varied sources of data and flexible non-linear data mining methods such as decision trees, which can produce interpretable models. Graphical AbstractDecision trees for the prediction of tissue partition coefficient and volume of distribution of drugs.
Dai, Huanping; Micheyl, Christophe
2012-11-01
Psychophysical "reverse-correlation" methods allow researchers to gain insight into the perceptual representations and decision weighting strategies of individual subjects in perceptual tasks. Although these methods have gained momentum, until recently their development was limited to experiments involving only two response categories. Recently, two approaches for estimating decision weights in m-alternative experiments have been put forward. One approach extends the two-category correlation method to m > 2 alternatives; the second uses multinomial logistic regression (MLR). In this article, the relative merits of the two methods are discussed, and the issues of convergence and statistical efficiency of the methods are evaluated quantitatively using Monte Carlo simulations. The results indicate that, for a range of values of the number of trials, the estimated weighting patterns are closer to their asymptotic values for the correlation method than for the MLR method. Moreover, for the MLR method, weight estimates for different stimulus components can exhibit strong correlations, making the analysis and interpretation of measured weighting patterns less straightforward than for the correlation method. These and other advantages of the correlation method, which include computational simplicity and a close relationship to other well-established psychophysical reverse-correlation methods, make it an attractive tool to uncover decision strategies in m-alternative experiments.
Quick, “Imputation-free” meta-analysis with proxy-SNPs
Directory of Open Access Journals (Sweden)
Meesters Christian
2012-09-01
Full Text Available Abstract Background Meta-analysis (MA is widely used to pool genome-wide association studies (GWASes in order to a increase the power to detect strong or weak genotype effects or b as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia to avoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software, however, enables cross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming. Results Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying on imputation. This is accomplished by using reference linkage disequilibrium data from 1,000 Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least one study. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Our algorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applications can be used as input files for MA, tremendously speeding up MA compared to the conventional imputation approach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possibly providing an incentive for follow-up studies. We propose our method as a quick screening step prior to imputation-based MA, as well as an additional main approach for studies without available reference data matching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II Diabetes GWAS and found that the proxy algorithm clearly outperforms naïve MA on the p-value level: for 17 out of 23 we observe an improvement on the p-value level by a factor of more than two, and a maximum improvement by a factor of 2127. Conclusions YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventional MA as well as inserting proxy
Zhao, Yu Xi; Xie, Ping; Sang, Yan Fang; Wu, Zi Yi
2018-04-01
Hydrological process evaluation is temporal dependent. Hydrological time series including dependence components do not meet the data consistency assumption for hydrological computation. Both of those factors cause great difficulty for water researches. Given the existence of hydrological dependence variability, we proposed a correlationcoefficient-based method for significance evaluation of hydrological dependence based on auto-regression model. By calculating the correlation coefficient between the original series and its dependence component and selecting reasonable thresholds of correlation coefficient, this method divided significance degree of dependence into no variability, weak variability, mid variability, strong variability, and drastic variability. By deducing the relationship between correlation coefficient and auto-correlation coefficient in each order of series, we found that the correlation coefficient was mainly determined by the magnitude of auto-correlation coefficient from the 1 order to p order, which clarified the theoretical basis of this method. With the first-order and second-order auto-regression models as examples, the reasonability of the deduced formula was verified through Monte-Carlo experiments to classify the relationship between correlation coefficient and auto-correlation coefficient. This method was used to analyze three observed hydrological time series. The results indicated the coexistence of stochastic and dependence characteristics in hydrological process.
Coskuntuncel, Orkun
2013-01-01
The purpose of this study is two-fold; the first aim being to show the effect of outliers on the widely used least squares regression estimator in social sciences. The second aim is to compare the classical method of least squares with the robust M-estimator using the "determination of coefficient" (R[superscript 2]). For this purpose,…
Assessing accuracy of genotype imputation in American Indians.
Directory of Open Access Journals (Sweden)
Alka Malhotra
Full Text Available Genotype imputation is commonly used in genetic association studies to test untyped variants using information on linkage disequilibrium (LD with typed markers. Imputing genotypes requires a suitable reference population in which the LD pattern is known, most often one selected from HapMap. However, some populations, such as American Indians, are not represented in HapMap. In the present study, we assessed accuracy of imputation using HapMap reference populations in a genome-wide association study in Pima Indians.Data from six randomly selected chromosomes were used. Genotypes in the study population were masked (either 1% or 20% of SNPs available for a given chromosome. The masked genotypes were then imputed using the software Markov Chain Haplotyping Algorithm. Using four HapMap reference populations, average genotype error rates ranged from 7.86% for Mexican Americans to 22.30% for Yoruba. In contrast, use of the original Pima Indian data as a reference resulted in an average error rate of 1.73%.Our results suggest that the use of HapMap reference populations results in substantial inaccuracy in the imputation of genotypes in American Indians. A possible solution would be to densely genotype or sequence a reference American Indian population.
TRIP: An interactive retrieving-inferring data imputation approach
Li, Zhixu
2016-06-25
Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1]. © 2016 IEEE.
TRIP: An interactive retrieving-inferring data imputation approach
Li, Zhixu; Qin, Lu; Cheng, Hong; Zhang, Xiangliang; Zhou, Xiaofang
2016-01-01
Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1]. © 2016 IEEE.
Imputed prices of greenhouse gases and land forests
International Nuclear Information System (INIS)
Uzawa, Hirofumi
1993-01-01
The theory of dynamic optimum formulated by Maeler gives us the basic theoretical framework within which it is possible to analyse the economic and, possibly, political circumstances under which the phenomenon of global warming occurs, and to search for the policy and institutional arrangements whereby it would be effectively arrested. The analysis developed here is an application of Maeler's theory to atmospheric quality. In the analysis a central role is played by the concept of imputed price in the dynamic context. Our determination of imputed prices of atmospheric carbon dioxide and land forests takes into account the difference in the stages of economic development. Indeed, the ratios of the imputed prices of atmospheric carbon dioxide and land forests over the per capita level of real national income are identical for all countries involved. (3 figures, 2 tables) (Author)
Fatekurohman, Mohamat; Nurmala, Nita; Anggraeni, Dian
2018-04-01
Lungs are the most important organ, in the case of respiratory system. Problems related to disorder of the lungs are various, i.e. pneumonia, emphysema, tuberculosis and lung cancer. Comparing all those problems, lung cancer is the most harmful. Considering about that, the aim of this research applies survival analysis and factors affecting the endurance of the lung cancer patient using comparison of exact, Efron and Breslow parameter approach method on hazard ratio and stratified cox regression model. The data applied are based on the medical records of lung cancer patients in Jember Paru-paru hospital on 2016, east java, Indonesia. The factors affecting the endurance of the lung cancer patients can be classified into several criteria, i.e. sex, age, hemoglobin, leukocytes, erythrocytes, sedimentation rate of blood, therapy status, general condition, body weight. The result shows that exact method of stratified cox regression model is better than other. On the other hand, the endurance of the patients is affected by their age and the general conditions.
A suggested approach for imputation of missing dietary data for young children in daycare
Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P.; Zeng, Donglin; Vaughn, Amber E.; Pratt, Charlotte; Ward, Dianne S.
2015-01-01
Background: Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult.Objective: The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method.Design: Data were from children aged 2-5 years in the My Parenting...
Missing data in clinical trials: control-based mean imputation and sensitivity analysis.
Mehrotra, Devan V; Liu, Fang; Permutt, Thomas
2017-09-01
In some randomized (drug versus placebo) clinical trials, the estimand of interest is the between-treatment difference in population means of a clinical endpoint that is free from the confounding effects of "rescue" medication (e.g., HbA1c change from baseline at 24 weeks that would be observed without rescue medication regardless of whether or when the assigned treatment was discontinued). In such settings, a missing data problem arises if some patients prematurely discontinue from the trial or initiate rescue medication while in the trial, the latter necessitating the discarding of post-rescue data. We caution that the commonly used mixed-effects model repeated measures analysis with the embedded missing at random assumption can deliver an exaggerated estimate of the aforementioned estimand of interest. This happens, in part, due to implicit imputation of an overly optimistic mean for "dropouts" (i.e., patients with missing endpoint data of interest) in the drug arm. We propose an alternative approach in which the missing mean for the drug arm dropouts is explicitly replaced with either the estimated mean of the entire endpoint distribution under placebo (primary analysis) or a sequence of increasingly more conservative means within a tipping point framework (sensitivity analysis); patient-level imputation is not required. A supplemental "dropout = failure" analysis is considered in which a common poor outcome is imputed for all dropouts followed by a between-treatment comparison using quantile regression. All analyses address the same estimand and can adjust for baseline covariates. Three examples and simulation results are used to support our recommendations. Copyright © 2017 John Wiley & Sons, Ltd.
Multiple imputation of missing passenger boarding data in the national census of ferry operators
2008-08-01
This report presents findings from the 2006 National Census of Ferry Operators (NCFO) augmented with imputed values for passengers and passenger miles. Due to the imputation procedures used to calculate missing data, totals in Table 1 may not corresp...
Directory of Open Access Journals (Sweden)
Mohd Faris Dziauddin
2017-07-01
Full Text Available This study estimates the effect of locational attributes on residential property values in Kuala Lumpur, Malaysia. Geographically weighted regression (GWR enables the use of the local parameter rather than the global parameter to be estimated, with the results presented in map form. The results of this study reveal that residential property values are mainly determined by the property’s physical (structural attributes, but proximity to locational attributes also contributes marginally. The use of GWR in this study is considered a better approach than other methods to examine the effect of locational attributes on residential property values. GWR has the capability to produce meaningful results in which different locational attributes have differential spatial effects across a geographical area on residential property values. This method has the ability to determine the factors on which premiums depend, and in turn it can assist the government in taxation matters.
Zhang, Hongyang; Welch, William J.; Zamar, Ruben H.
2017-01-01
Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset of features that work well together for prediction. We propose a novel algorithm which automatically chooses Regression Phalanxes from high-dimensi...
Directory of Open Access Journals (Sweden)
Puett Robin C
2009-10-01
Full Text Available Abstract Background There is increasing interest in the study of place effects on health, facilitated in part by geographic information systems. Incomplete or missing address information reduces geocoding success. Several geographic imputation methods have been suggested to overcome this limitation. Accuracy evaluation of these methods can be focused at the level of individuals and at higher group-levels (e.g., spatial distribution. Methods We evaluated the accuracy of eight geo-imputation methods for address allocation from ZIP codes to census tracts at the individual and group level. The spatial apportioning approaches underlying the imputation methods included four fixed (deterministic and four random (stochastic allocation methods using land area, total population, population under age 20, and race/ethnicity as weighting factors. Data included more than 2,000 geocoded cases of diabetes mellitus among youth aged 0-19 in four U.S. regions. The imputed distribution of cases across tracts was compared to the true distribution using a chi-squared statistic. Results At the individual level, population-weighted (total or under age 20 fixed allocation showed the greatest level of accuracy, with correct census tract assignments averaging 30.01% across all regions, followed by the race/ethnicity-weighted random method (23.83%. The true distribution of cases across census tracts was that 58.2% of tracts exhibited no cases, 26.2% had one case, 9.5% had two cases, and less than 3% had three or more. This distribution was best captured by random allocation methods, with no significant differences (p-value > 0.90. However, significant differences in distributions based on fixed allocation methods were found (p-value Conclusion Fixed imputation methods seemed to yield greatest accuracy at the individual level, suggesting use for studies on area-level environmental exposures. Fixed methods result in artificial clusters in single census tracts. For studies
Rosner, Bernard; Colditz, Graham A.
2011-01-01
Purpose Age at menopause, a major marker in the reproductive life, may bias results for evaluation of breast cancer risk after menopause. Methods We follow 38,948 premenopausal women in 1980 and identify 2,586 who reported hysterectomy without bilateral oophorectomy, and 31,626 who reported natural menopause during 22 years of follow-up. We evaluate risk factors for natural menopause, impute age at natural menopause for women reporting hysterectomy without bilateral oophorectomy and estimate the hazard of reaching natural menopause in the next 2 years. We apply this imputed age at menopause to both increase sample size and to evaluate the relation between postmenopausal exposures and risk of breast cancer. Results Age, cigarette smoking, age at menarche, pregnancy history, body mass index, history of benign breast disease, and history of breast cancer were each significantly related to age at natural menopause; duration of oral contraceptive use and family history of breast cancer were not. The imputation increased sample size substantially and although some risk factors after menopause were weaker in the expanded model (height, and alcohol use), use of hormone therapy is less biased. Conclusions Imputing age at menopause increases sample size, broadens generalizability making it applicable to women with hysterectomy, and reduces bias. PMID:21441037
Zhang, L; Liu, X J
2016-06-03
With the rapid development of next-generation high-throughput sequencing technology, RNA-seq has become a standard and important technique for transcriptome analysis. For multi-sample RNA-seq data, the existing expression estimation methods usually deal with each single-RNA-seq sample, and ignore that the read distributions are consistent across multiple samples. In the current study, we propose a structured sparse regression method, SSRSeq, to estimate isoform expression using multi-sample RNA-seq data. SSRSeq uses a non-parameter model to capture the general tendency of non-uniformity read distribution for all genes across multiple samples. Additionally, our method adds a structured sparse regularization, which not only incorporates the sparse specificity between a gene and its corresponding isoform expression levels, but also reduces the effects of noisy reads, especially for lowly expressed genes and isoforms. Four real datasets were used to evaluate our method on isoform expression estimation. Compared with other popular methods, SSRSeq reduced the variance between multiple samples, and produced more accurate isoform expression estimations, and thus more meaningful biological interpretations.
Directory of Open Access Journals (Sweden)
Ying-Hsin Chang
2013-01-01
Full Text Available Human estrogen receptor (ER isoforms, ERα and ERβ, have long been an important focus in the field of biology. To better understand the structural features associated with the binding of ERα ligands to ERα and modulate their function, several QSAR models, including CoMFA, CoMSIA, SVR, and LR methods, have been employed to predict the inhibitory activity of 68 raloxifene derivatives. In the SVR and LR modeling, 11 descriptors were selected through feature ranking and sequential feature addition/deletion to generate equations to predict the inhibitory activity toward ERα. Among four descriptors that constantly appear in various generated equations, two agree with CoMFA and CoMSIA steric fields and another two can be correlated to a calculated electrostatic potential of ERα.
Tian, Ting; McLachlan, Geoffrey J; Dieters, Mark J; Basford, Kaye E
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.
Sensitivity analysis in multiple imputation in effectiveness studies of psychotherapy.
Crameri, Aureliano; von Wyl, Agnes; Koemeda, Margit; Schulthess, Peter; Tschuschke, Volker
2015-01-01
The importance of preventing and treating incomplete data in effectiveness studies is nowadays emphasized. However, most of the publications focus on randomized clinical trials (RCT). One flexible technique for statistical inference with missing data is multiple imputation (MI). Since methods such as MI rely on the assumption of missing data being at random (MAR), a sensitivity analysis for testing the robustness against departures from this assumption is required. In this paper we present a sensitivity analysis technique based on posterior predictive checking, which takes into consideration the concept of clinical significance used in the evaluation of intra-individual changes. We demonstrate the possibilities this technique can offer with the example of irregular longitudinal data collected with the Outcome Questionnaire-45 (OQ-45) and the Helping Alliance Questionnaire (HAQ) in a sample of 260 outpatients. The sensitivity analysis can be used to (1) quantify the degree of bias introduced by missing not at random data (MNAR) in a worst reasonable case scenario, (2) compare the performance of different analysis methods for dealing with missing data, or (3) detect the influence of possible violations to the model assumptions (e.g., lack of normality). Moreover, our analysis showed that ratings from the patient's and therapist's version of the HAQ could significantly improve the predictive value of the routine outcome monitoring based on the OQ-45. Since analysis dropouts always occur, repeated measurements with the OQ-45 and the HAQ analyzed with MI are useful to improve the accuracy of outcome estimates in quality assurance assessments and non-randomized effectiveness studies in the field of outpatient psychotherapy.
Directory of Open Access Journals (Sweden)
Corrado Dimauro
2010-01-01
Full Text Available Two methods of SNPs pre-selection based on single marker regression for the estimation of genomic breeding values (G-EBVs were compared using simulated data provided by the XII QTL-MAS workshop: i Bonferroni correction of the significance threshold and ii Permutation test to obtain the reference distribution of the null hypothesis and identify significant markers at P<0.01 and P<0.001 significance thresholds. From the set of markers significant at P<0.001, random subsets of 50% and 25% markers were extracted, to evaluate the effect of further reducing the number of significant SNPs on G-EBV predictions. The Bonferroni correction method allowed the identification of 595 significant SNPs that gave the best G-EBV accuracies in prediction generations (82.80%. The permutation methods gave slightly lower G-EBV accuracies even if a larger number of SNPs resulted significant (2,053 and 1,352 for 0.01 and 0.001 significance thresholds, respectively. Interestingly, halving or dividing by four the number of SNPs significant at P<0.001 resulted in an only slightly decrease of G-EBV accuracies. The genetic structure of the simulated population with few QTL carrying large effects, might have favoured the Bonferroni method.
Akbari, Somaye; Zebardast, Tannaz; Zarghi, Afshin; Hajimahdi, Zahra
2017-01-01
COX-2 inhibitory activities of some 1,4-dihydropyridine and 5-oxo-1,4,5,6,7,8-hexahydroquinoline derivatives were modeled by quantitative structure-activity relationship (QSAR) using stepwise-multiple linear regression (SW-MLR) method. The built model was robust and predictive with correlation coefficient (R 2 ) of 0.972 and 0.531 for training and test groups, respectively. The quality of the model was evaluated by leave-one-out (LOO) cross validation (LOO correlation coefficient (Q 2 ) of 0.943) and Y-randomization. We also employed a leverage approach for the defining of applicability domain of model. Based on QSAR models results, COX-2 inhibitory activity of selected data set had correlation with BEHm6 (highest eigenvalue n. 6 of Burden matrix/weighted by atomic masses), Mor03u (signal 03/unweighted) and IVDE (Mean information content on the vertex degree equality) descriptors which derived from their structures.
Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit
2017-06-01
Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies.
Sequence imputation of HPV16 genomes for genetic association studies.
Directory of Open Access Journals (Sweden)
Benjamin Smith
Full Text Available Human Papillomavirus type 16 (HPV16 causes over half of all cervical cancer and some HPV16 variants are more oncogenic than others. The genetic basis for the extraordinary oncogenic properties of HPV16 compared to other HPVs is unknown. In addition, we neither know which nucleotides vary across and within HPV types and lineages, nor which of the single nucleotide polymorphisms (SNPs determine oncogenicity.A reference set of 62 HPV16 complete genome sequences was established and used to examine patterns of evolutionary relatedness amongst variants using a pairwise identity heatmap and HPV16 phylogeny. A BLAST-based algorithm was developed to impute complete genome data from partial sequence information using the reference database. To interrogate the oncogenic risk of determined and imputed HPV16 SNPs, odds-ratios for each SNP were calculated in a case-control viral genome-wide association study (VWAS using biopsy confirmed high-grade cervix neoplasia and self-limited HPV16 infections from Guanacaste, Costa Rica.HPV16 variants display evolutionarily stable lineages that contain conserved diagnostic SNPs. The imputation algorithm indicated that an average of 97.5±1.03% of SNPs could be accurately imputed. The VWAS revealed specific HPV16 viral SNPs associated with variant lineages and elevated odds ratios; however, individual causal SNPs could not be distinguished with certainty due to the nature of HPV evolution.Conserved and lineage-specific SNPs can be imputed with a high degree of accuracy from limited viral polymorphic data due to the lack of recombination and the stochastic mechanism of variation accumulation in the HPV genome. However, to determine the role of novel variants or non-lineage-specific SNPs by VWAS will require direct sequence analysis. The investigation of patterns of genetic variation and the identification of diagnostic SNPs for lineages of HPV16 variants provides a valuable resource for future studies of HPV16
Imputing amino acid polymorphisms in human leukocyte antigens.
Directory of Open Access Journals (Sweden)
Xiaoming Jia
Full Text Available DNA sequence variation within human leukocyte antigen (HLA genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A, -B, -C and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals. We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N = 918 with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the high-density Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.
Directory of Open Access Journals (Sweden)
Lüdtke Rainer
2008-08-01
Full Text Available Abstract Background Regression to the mean (RTM occurs in situations of repeated measurements when extreme values are followed by measurements in the same subjects that are closer to the mean of the basic population. In uncontrolled studies such changes are likely to be interpreted as a real treatment effect. Methods Several statistical approaches have been developed to analyse such situations, including the algorithm of Mee and Chua which assumes a known population mean μ. We extend this approach to a situation where μ is unknown and suggest to vary it systematically over a range of reasonable values. Using differential calculus we provide formulas to estimate the range of μ where treatment effects are likely to occur when RTM is present. Results We successfully applied our method to three real world examples denoting situations when (a no treatment effect can be confirmed regardless which μ is true, (b when a treatment effect must be assumed independent from the true μ and (c in the appraisal of results of uncontrolled studies. Conclusion Our method can be used to separate the wheat from the chaff in situations, when one has to interpret the results of uncontrolled studies. In meta-analysis, health-technology reports or systematic reviews this approach may be helpful to clarify the evidence given from uncontrolled observational studies.
Regression analysis by example
Chatterjee, Samprit
2012-01-01
Praise for the Fourth Edition: ""This book is . . . an excellent source of examples for regression analysis. It has been and still is readily readable and understandable."" -Journal of the American Statistical Association Regression analysis is a conceptually simple method for investigating relationships among variables. Carrying out a successful application of regression analysis, however, requires a balance of theoretical results, empirical rules, and subjective judgment. Regression Analysis by Example, Fifth Edition has been expanded
International Nuclear Information System (INIS)
Arsenault, Louis-François; Millis, Andrew J; Neuberg, Richard; Hannah, Lauren A
2017-01-01
We present a supervised machine learning approach to the inversion of Fredholm integrals of the first kind as they arise, for example, in the analytic continuation problem of quantum many-body physics. The approach provides a natural regularization for the ill-conditioned inverse of the Fredholm kernel, as well as an efficient and stable treatment of constraints. The key observation is that the stability of the forward problem permits the construction of a large database of outputs for physically meaningful inputs. Applying machine learning to this database generates a regression function of controlled complexity, which returns approximate solutions for previously unseen inputs; the approximate solutions are then projected onto the subspace of functions satisfying relevant constraints. Under standard error metrics the method performs as well or better than the Maximum Entropy method for low input noise and is substantially more robust to increased input noise. We suggest that the methodology will be similarly effective for other problems involving a formally ill-conditioned inversion of an integral operator, provided that the forward problem can be efficiently solved. (paper)
Energy Technology Data Exchange (ETDEWEB)
Golusin, Mirjana [Educons University, Vojvode Putnika st. bb, 21013 Sremska Kamnica (RS); Ivanovic, Olja Munitlak [Faculty of Business in Services, Vojvode Putnik st. bb, 21013 Sremska Kamenica (RS); Teodorovic, Natasa [Faculty of Entrepreneurial Management, Modene st. 5, 21000 Novi Sad (RS)
2011-01-15
The need for preservation and adequate management of the quality of environment requires the development of new methods and techniques by which the achieved degree of sustainable development can be defined as well as the laws regarding the relationship among its subsystems. Main objective of research is to point to a strong contradiction between the development of ecological and economic subsystems. In order to improve previous research, this study suggests the use of linear evaluation, by which it is possible to determine the exact degree of contradiction between these two subsystems and to define the regularities as well as the deviations. Authors present the essential steps that were used. Conducted by the method of linear regression this research shows a significant negative correlation between ecological and economic subsystem indicators, whereas its value R{sup 2} 0.58 proves the expected contradiction that exists between the two previously mentioned subsystems. By observing the sustainable development as a two-dimensional system that includes ecological and economic indicators, the authors suggest the methodology to modelling the relationship between economic and ecological development as an orthogonal distance between the degree of the current state measured by the relation between economic and ecological indicators of sustainable development and the degree which was obtained in a traditional way. The method used in this research proved to be extremely suitable for modelling the relationship between ecological and economic subsystems of sustainable development. This research was conducted on a repeated sample of countries of South East Europe by including the data for France and Germany, being two countries on the highest level of development in the European Union. (author)
Multiple imputation to account for measurement error in marginal structural models
Edwards, Jessie K.; Cole, Stephen R.; Westreich, Daniel; Crane, Heidi; Eron, Joseph J.; Mathews, W. Christopher; Moore, Richard; Boswell, Stephen L.; Lesko, Catherine R.; Mugavero, Michael J.
2015-01-01
Background Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and non-differential measurement error in a marginal structural model. Methods We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. Results In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality [hazard ratio (HR): 1.2 (95% CI: 0.6, 2.3)]. The HR for current smoking and therapy (0.4 (95% CI: 0.2, 0.7)) was similar to the HR for no smoking and therapy (0.4; 95% CI: 0.2, 0.6). Conclusions Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies. PMID:26214338
Energy Technology Data Exchange (ETDEWEB)
Lee, Sang Dae; Lohumi, Santosh; Cho, Byoung Kwan [Dept. of Biosystems Machinery Engineering, Chungnam National University, Daejeon (Korea, Republic of); Kim, Moon Sung [United States Department of Agriculture Agricultural Research Service, Washington (United States); Lee, Soo Hee [Life and Technology Co.,Ltd., Hwasung (Korea, Republic of)
2014-08-15
This study was conducted to develop a non-destructive detection method for adulterated powder products using Raman spectroscopy and partial least squares regression(PLSR). Garlic and ginger powder, which are used as natural seasoning and in health supplement foods, were selected for this experiment. Samples were adulterated with corn starch in concentrations of 5-35%. PLSR models for adulterated garlic and ginger powders were developed and their performances evaluated using cross validation. The R{sup 2}{sub c} and SEC of an optimal PLSR model were 0.99 and 2.16 for the garlic powder samples, and 0.99 and 0.84 for the ginger samples, respectively. The variable importance in projection (VIP) score is a useful and simple tool for the evaluation of the importance of each variable in a PLSR model. After the VIP scores were taken pre-selection, the Raman spectrum data was reduced by one third. New PLSR models, based on a reduced number of wavelengths selected by the VIP scores technique, gave good predictions for the adulterated garlic and ginger powder samples.
Akita, Yasuyuki; Baldasano, Jose M; Beelen, Rob; Cirach, Marta; de Hoogh, Kees; Hoek, Gerard; Nieuwenhuijsen, Mark; Serre, Marc L; de Nazelle, Audrey
2014-04-15
In recognition that intraurban exposure gradients may be as large as between-city variations, recent air pollution epidemiologic studies have become increasingly interested in capturing within-city exposure gradients. In addition, because of the rapidly accumulating health data, recent studies also need to handle large study populations distributed over large geographic domains. Even though several modeling approaches have been introduced, a consistent modeling framework capturing within-city exposure variability and applicable to large geographic domains is still missing. To address these needs, we proposed a modeling framework based on the Bayesian Maximum Entropy method that integrates monitoring data and outputs from existing air quality models based on Land Use Regression (LUR) and Chemical Transport Models (CTM). The framework was applied to estimate the yearly average NO2 concentrations over the region of Catalunya in Spain. By jointly accounting for the global scale variability in the concentration from the output of CTM and the intraurban scale variability through LUR model output, the proposed framework outperformed more conventional approaches.
International Nuclear Information System (INIS)
Lee, Sang Dae; Lohumi, Santosh; Cho, Byoung Kwan; Kim, Moon Sung; Lee, Soo Hee
2014-01-01
This study was conducted to develop a non-destructive detection method for adulterated powder products using Raman spectroscopy and partial least squares regression(PLSR). Garlic and ginger powder, which are used as natural seasoning and in health supplement foods, were selected for this experiment. Samples were adulterated with corn starch in concentrations of 5-35%. PLSR models for adulterated garlic and ginger powders were developed and their performances evaluated using cross validation. The R 2 c and SEC of an optimal PLSR model were 0.99 and 2.16 for the garlic powder samples, and 0.99 and 0.84 for the ginger samples, respectively. The variable importance in projection (VIP) score is a useful and simple tool for the evaluation of the importance of each variable in a PLSR model. After the VIP scores were taken pre-selection, the Raman spectrum data was reduced by one third. New PLSR models, based on a reduced number of wavelengths selected by the VIP scores technique, gave good predictions for the adulterated garlic and ginger powder samples.
Matson, Johnny L.; Kozlowski, Alison M.
2010-01-01
Autistic regression is one of the many mysteries in the developmental course of autism and pervasive developmental disorders not otherwise specified (PDD-NOS). Various definitions of this phenomenon have been used, further clouding the study of the topic. Despite this problem, some efforts at establishing prevalence have been made. The purpose of…
Combining item response theory with multiple imputation to equate health assessment questionnaires.
Gu, Chenyang; Gutman, Roee
2017-09-01
The assessment of patients' functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set. © 2016, The International Biometric Society.
Multiple Imputation to Account for Measurement Error in Marginal Structural Models.
Edwards, Jessie K; Cole, Stephen R; Westreich, Daniel; Crane, Heidi; Eron, Joseph J; Mathews, W Christopher; Moore, Richard; Boswell, Stephen L; Lesko, Catherine R; Mugavero, Michael J
2015-09-01
Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and nondifferential measurement error in a marginal structural model. We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3,686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality (hazard ratio [HR]: 1.2 [95% confidence interval [CI] = 0.6, 2.3]). The HR for current smoking and therapy [0.4 (95% CI = 0.2, 0.7)] was similar to the HR for no smoking and therapy (0.4; 95% CI = 0.2, 0.6). Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies.
An Imputation Model for Dropouts in Unemployment Data
Directory of Open Access Journals (Sweden)
Nilsson Petra
2016-09-01
Full Text Available Incomplete unemployment data is a fundamental problem when evaluating labour market policies in several countries. Many unemployment spells end for unknown reasons; in the Swedish Public Employment Service’s register as many as 20 percent. This leads to an ambiguity regarding destination states (employment, unemployment, retired, etc.. According to complete combined administrative data, the employment rate among dropouts was close to 50 for the years 1992 to 2006, but from 2007 the employment rate has dropped to 40 or less. This article explores an imputation approach. We investigate imputation models estimated both on survey data from 2005/2006 and on complete combined administrative data from 2005/2006 and 2011/2012. The models are evaluated in terms of their ability to make correct predictions. The models have relatively high predictive power.
Hopke, P K; Liu, C; Rubin, D B
2001-03-01
Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.
A regressive methodology for estimating missing data in rainfall daily time series
Barca, E.; Passarella, G.
2009-04-01
The "presence" of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results (Rubin, 1976). Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of "missingness" mechanisms. When data missing is conditioned by some other variable observed in the data set (Schafer, 1997) the mechanism is called MAR (Missing at Random). Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR (Not Missing at Random). This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression (single imputation). More recently multiple imputation has become also available, which returns a distribution of estimated values (Scheffer, 2002). In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data (target station), the methodology checks the randomness of the missing data and classifies the "similarity" between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station: the Theil method (Theil, 1950). Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to
A spatial haplotype copying model with applications to genotype imputation.
Yang, Wen-Yun; Hormozdiari, Farhad; Eskin, Eleazar; Pasaniuc, Bogdan
2015-05-01
Ever since its introduction, the haplotype copy model has proven to be one of the most successful approaches for modeling genetic variation in human populations, with applications ranging from ancestry inference to genotype phasing and imputation. Motivated by coalescent theory, this approach assumes that any chromosome (haplotype) can be modeled as a mosaic of segments copied from a set of chromosomes sampled from the same population. At the core of the model is the assumption that any chromosome from the sample is equally likely to contribute a priori to the copying process. Motivated by recent works that model genetic variation in a geographic continuum, we propose a new spatial-aware haplotype copy model that jointly models geography and the haplotype copying process. We extend hidden Markov models of haplotype diversity such that at any given location, haplotypes that are closest in the genetic-geographic continuum map are a priori more likely to contribute to the copying process than distant ones. Through simulations starting from the 1000 Genomes data, we show that our model achieves superior accuracy in genotype imputation over the standard spatial-unaware haplotype copy model. In addition, we show the utility of our model in selecting a small personalized reference panel for imputation that leads to both improved accuracy as well as to a lower computational runtime than the standard approach. Finally, we show our proposed model can be used to localize individuals on the genetic-geographical map on the basis of their genotype data.
Gómez-Valent, Adrià; Amendola, Luca
2018-04-01
In this paper we present new constraints on the Hubble parameter H0 using: (i) the available data on H(z) obtained from cosmic chronometers (CCH); (ii) the Hubble rate data points extracted from the supernovae of Type Ia (SnIa) of the Pantheon compilation and the Hubble Space Telescope (HST) CANDELS and CLASH Multy-Cycle Treasury (MCT) programs; and (iii) the local HST measurement of H0 provided by Riess et al. (2018), H0HST=(73.45±1.66) km/s/Mpc. Various determinations of H0 using the Gaussian processes (GPs) method and the most updated list of CCH data have been recently provided by Yu, Ratra & Wang (2018). Using the Gaussian kernel they find H0=(67.42± 4.75) km/s/Mpc. Here we extend their analysis to also include the most released and complete set of SnIa data, which allows us to reduce the uncertainty by a factor ~ 3 with respect to the result found by only considering the CCH information. We obtain H0=(67.06± 1.68) km/s/Mpc, which favors again the lower range of values for H0 and is in tension with H0HST. The tension reaches the 2.71σ level. We round off the GPs determination too by taking also into account the error propagation of the kernel hyperparameters when the CCH with and without H0HST are used in the analysis. In addition, we present a novel method to reconstruct functions from data, which consists in a weighted sum of polynomial regressions (WPR). We apply it from a cosmographic perspective to reconstruct H(z) and estimate H0 from CCH and SnIa measurements. The result obtained with this method, H0=(68.90± 1.96) km/s/Mpc, is fully compatible with the GPs ones. Finally, a more conservative GPs+WPR value is also provided, H0=(68.45± 2.00) km/s/Mpc, which is still almost 2σ away from H0HST.
Directory of Open Access Journals (Sweden)
Assaf Gottlieb
2017-11-01
Full Text Available Abstract Background Genome-wide association studies are useful for discovering genotype–phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into “gene level” effects. Methods Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression—on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. Results We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Conclusions Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort
Cao, M H; Adeola, O
2016-02-01
The energy values of poultry byproduct meal (PBM) and animal-vegetable oil blend (A-V blend) were determined in 2 experiments with 288 broiler chickens from d 19 to 25 post hatching. The birds were fed a starter diet from d 0 to 19 post hatching. In each experiment, 144 birds were grouped by weight into 8 replicates of cages with 6 birds per cage. There were 3 diets in each experiment consisting of one reference diet (RD) and 2 test diets (TD). The TD contained 2 levels of PBM (Exp. 1) or A-V blend (Exp. 2) that replaced the energy sources in the RD at 50 or 100 g/kg (Exp. 1) or 40 or 80 g/kg (Exp. 2) in such a way that the same ratio were maintained for energy ingredients across experimental diets. The ileal digestible energy (IDE), ME, and MEn of PBM and A-V blend were determined by the regression method. Dry matter of PBM and A-V blend were 984 and 999 g/kg; the gross energies were 5,284 and 9,604 kcal/kg of DM, respectively. Addition of PBM to the RD in Exp. 1 linearly decreased (P blend to the RD linearly increased (P blend as follows: IDE = 10,616x + 7.350, r(2) = 0.96; ME = 10,121x + 0.447, r(2) = 0.99; MEn = 10,124x + 2.425, r(2) = 0.99. These data indicate the respective IDE, ME, MEn values (kcal/kg of DM) of PBM evaluated to be 3,537, 3,805, and 3,278, and A-V blend evaluated to be 10,616, 10,121, and 10,124. © 2015 Poultry Science Association Inc.
Helms, Ronald W; Reece, Laura Helms; Helms, Russell W; Helms, Mary W
2011-03-01
Missing not at random (MNAR) post-dropout missing data from a longitudinal clinical trial result in the collection of "biased data," which leads to biased estimators and tests of corrupted hypotheses. In a full rank linear model analysis the model equation, E[Y] = Xβ, leads to the definition of the primary parameter β = (X'X)(-1)X'E[Y], and the definition of linear secondary parameters of the form θ = Lβ = L(X'X)(-1)X'E[Y], including, for example, a parameter representing a "treatment effect." These parameters depend explicitly on E[Y], which raises the questions: What is E[Y] when some elements of the incomplete random vector Y are not observed and MNAR, or when such a Y is "completed" via imputation? We develop a rigorous, readily interpretable definition of E[Y] in this context that leads directly to definitions of β, Bias(β) = E[β] - β, Bias(θ) = E[θ] - Lβ, and the extent of hypothesis corruption. These definitions provide a basis for evaluating, comparing, and removing biases induced by various linear imputation methods for MNAR incomplete data from longitudinal clinical trials. Linear imputation methods use earlier data from a subject to impute values for post-dropout missing values and include "Last Observation Carried Forward" (LOCF) and "Baseline Observation Carried Forward" (BOCF), among others. We illustrate the methods of evaluating, comparing, and removing biases and the effects of testing corresponding corrupted hypotheses via a hypothetical but very realistic longitudinal analgesic clinical trial.
Choi, Giehae; Bell, Michelle L.; Lee, Jong-Tae
2017-04-01
The land-use regression (LUR) approach to estimate the levels of ambient air pollutants is becoming popular due to its high validity in predicting small-area variations. However, only a few studies have been conducted in Asian countries, and much less research has been conducted on comparing the performances and applied estimates of different exposure assessments including LUR. The main objectives of the current study were to conduct nitrogen dioxide (NO2) exposure assessment with four methods including LUR in the Republic of Korea, to compare the model performances, and to estimate the empirical NO2 exposures of a cohort. The study population was defined as the year 2010 participants of a government-supported cohort established for bio-monitoring in Ulsan, Republic of Korea. The annual ambient NO2 exposures of the 969 study participants were estimated with LUR, nearest station, inverse distance weighting, and ordinary kriging. Modeling was based on the annual NO2 average, traffic-related data, land-use data, and altitude of the 13 regularly monitored stations. The final LUR model indicated that area of transportation, distance to residential area, and area of wetland were important predictors of NO2. The LUR model explained 85.8% of the variation observed in the 13 monitoring stations of the year 2009. The LUR model outperformed the others based on leave-one out cross-validation comparing the correlations and root-mean square error. All NO2 estimates ranged from 11.3-18.0 ppb, with that of LUR having the widest range. The NO2 exposure levels of the residents differed by demographics. However, the average was below the national annual guidelines of the Republic of Korea (30 ppb). The LUR models showed high performances in an industrial city in the Republic of Korea, despite the small sample size and limited data. Our findings suggest that the LUR method may be useful in similar settings in Asian countries where the target region is small and availability of data is
Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.
Ritz, Cecilia; Edén, Patrik
2008-01-19
For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.
Gottlieb, Assaf; Daneshjou, Roxana; DeGorter, Marianne; Bourgeois, Stephane; Svensson, Peter J; Wadelius, Mia; Deloukas, Panos; Montgomery, Stephen B; Altman, Russ B
2017-11-24
Genome-wide association studies are useful for discovering genotype-phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into "gene level" effects. Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression-on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort.
FCMPSO: An Imputation for Missing Data Features in Heart Disease Classification
Salleh, Mohd Najib Mohd; Ashikin Samat, Nurul
2017-08-01
The application of data mining and machine learning in directing clinical research into possible hidden knowledge is becoming greatly influential in medical areas. Heart Disease is a killer disease around the world, and early prevention through efficient methods can help to reduce the mortality number. Medical data may contain many uncertainties, as they are fuzzy and vague in nature. Nonetheless, imprecise features data such as no values and missing values can affect quality of classification results. Nevertheless, the other complete features are still capable to give information in certain features. Therefore, an imputation approach based on Fuzzy C-Means and Particle Swarm Optimization (FCMPSO) is developed in preprocessing stage to help fill in the missing values. Then, the complete dataset is trained in classification algorithm, Decision Tree. The experiment is trained with Heart Disease dataset and the performance is analysed using accuracy, precision, and ROC values. Results show that the performance of Decision Tree is increased after the application of FCMSPO for imputation.
Performance of genotype imputation for low frequency and rare variants from the 1000 genomes.
Zheng, Hou-Feng; Rong, Jing-Jing; Liu, Ming; Han, Fang; Zhang, Xing-Wei; Richards, J Brent; Wang, Li
2015-01-01
Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF ≤ 0.3%), only 0-1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.
Wan, Jian; Chen, Yi-Chieh; Morris, A Julian; Thennadil, Suresh N
2017-07-01
Near-infrared (NIR) spectroscopy is being widely used in various fields ranging from pharmaceutics to the food industry for analyzing chemical and physical properties of the substances concerned. Its advantages over other analytical techniques include available physical interpretation of spectral data, nondestructive nature and high speed of measurements, and little or no need for sample preparation. The successful application of NIR spectroscopy relies on three main aspects: pre-processing of spectral data to eliminate nonlinear variations due to temperature, light scattering effects and many others, selection of those wavelengths that contribute useful information, and identification of suitable calibration models using linear/nonlinear regression . Several methods have been developed for each of these three aspects and many comparative studies of different methods exist for an individual aspect or some combinations. However, there is still a lack of comparative studies for the interactions among these three aspects, which can shed light on what role each aspect plays in the calibration and how to combine various methods of each aspect together to obtain the best calibration model. This paper aims to provide such a comparative study based on four benchmark data sets using three typical pre-processing methods, namely, orthogonal signal correction (OSC), extended multiplicative signal correction (EMSC) and optical path-length estimation and correction (OPLEC); two existing wavelength selection methods, namely, stepwise forward selection (SFS) and genetic algorithm optimization combined with partial least squares regression for spectral data (GAPLSSP); four popular regression methods, namely, partial least squares (PLS), least absolute shrinkage and selection operator (LASSO), least squares support vector machine (LS-SVM), and Gaussian process regression (GPR). The comparative study indicates that, in general, pre-processing of spectral data can play a significant
Highly accurate sequence imputation enables precise QTL mapping in Brown Swiss cattle.
Frischknecht, Mirjam; Pausch, Hubert; Bapst, Beat; Signer-Hasler, Heidi; Flury, Christine; Garrick, Dorian; Stricker, Christian; Fries, Ruedi; Gredler-Grandl, Birgit
2017-12-29
Within the last few years a large amount of genomic information has become available in cattle. Densities of genomic information vary from a few thousand variants up to whole genome sequence information. In order to combine genomic information from different sources and infer genotypes for a common set of variants, genotype imputation is required. In this study we evaluated the accuracy of imputation from high density chips to whole genome sequence data in Brown Swiss cattle. Using four popular imputation programs (Beagle, FImpute, Impute2, Minimac) and various compositions of reference panels, the accuracy of the imputed sequence variant genotypes was high and differences between the programs and scenarios were small. We imputed sequence variant genotypes for more than 1600 Brown Swiss bulls and performed genome-wide association studies for milk fat percentage at two stages of lactation. We found one and three quantitative trait loci for early and late lactation fat content, respectively. Known causal variants that were imputed from the sequenced reference panel were among the most significantly associated variants of the genome-wide association study. Our study demonstrates that whole-genome sequence information can be imputed at high accuracy in cattle populations. Using imputed sequence variant genotypes in genome-wide association studies may facilitate causal variant detection.
Imputation of genotypes in Danish two-way crossbred pigs using low density panels
DEFF Research Database (Denmark)
Xiang, Tao; Christensen, Ole Fredslund; Legarra, Andres
Genotype imputation is commonly used as an initial step of genomic selection. Studies on humans, plants and ruminants suggested many factors would affect the performance of imputation. However, studies rarely investigated pigs, especially crossbred pigs. In this study, different scenarios...... of imputation from 5K SNPs to 7K SNPs on Danish Landrace, Yorkshire, and crossbred Landrace-Yorkshire were compared. In conclusion, genotype imputation on crossbreds performs equally well as in purebreds, when parental breeds are used as the reference panel. When the size of reference is considerably large...... SNPs. This dataset will be analyzed for genomic selection in a future study...
Imputation and quality control steps for combining multiple genome-wide datasets
Directory of Open Access Journals (Sweden)
Shefali S Verma
2014-12-01
Full Text Available The electronic MEdical Records and GEnomics (eMERGE network brings together DNA biobanks linked to electronic health records (EHRs from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes, and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2 were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
Bolarinwa, O A; Adeola, O
2016-02-01
Direct or indirect methods can be used to determine the DE and ME of feed ingredients for pigs. In situations when only the indirect approach is suitable, the regression method presents a robust indirect approach. Three experiments were conducted to compare the direct and regression methods for determining the DE and ME values of barley, sorghum, and wheat for pigs. In each experiment, 24 barrows with an average initial BW of 31, 32, and 33 kg were assigned to 4 diets in a randomized complete block design. The 4 diets consisted of 969 g barley, sorghum, or wheat/kg plus minerals and vitamins for the direct method; a corn-soybean meal reference diet (RD); the RD + 300 g barley, sorghum, or wheat/kg; and the RD + 600 g barley, sorghum, or wheat/kg. The 3 corn-soybean meal diets were used for the regression method. Each diet was fed to 6 barrows in individual metabolism crates for a 5-d acclimation followed by a 5-d period of total but separate collection of feces and urine in each experiment. Graded substitution of barley or wheat, but not sorghum, into the RD linearly reduced ( direct method-derived DE and ME for barley were 3,669 and 3,593 kcal/kg DM, respectively. The regressions of barley contribution to DE and ME in kilocalories against the quantity of barley DMI in kilograms generated 3,746 kcal DE/kg DM and 3,647 kcal ME/kg DM. The DE and ME for sorghum by the direct method were 4,097 and 4,042 kcal/kg DM, respectively; the corresponding regression-derived estimates were 4,145 and 4,066 kcal/kg DM. Using the direct method, energy values for wheat were 3,953 kcal DE/kg DM and 3,889 kcal ME/kg DM. The regressions of wheat contribution to DE and ME in kilocalories against the quantity of wheat DMI in kilograms generated 3,960 kcal DE/kg DM and 3,874 kcal ME/kg DM. The DE and ME of barley using the direct method were not different (0.3 direct method-derived DE and ME of sorghum were not different (0.5 direct method- and regression method-derived DE (3,953 and 3
Directory of Open Access Journals (Sweden)
Patrice L. Capers
2015-03-01
Full Text Available BACKGROUND: Meta-research can involve manual retrieval and evaluation of research, which is resource intensive. Creation of high throughput methods (e.g., search heuristics, crowdsourcing has improved feasibility of large meta-research questions, but possibly at the cost of accuracy. OBJECTIVE: To evaluate the use of double sampling combined with multiple imputation (DS+MI to address meta-research questions, using as an example adherence of PubMed entries to two simple Consolidated Standards of Reporting Trials (CONSORT guidelines for titles and abstracts. METHODS: For the DS large sample, we retrieved all PubMed entries satisfying the filters: RCT; human; abstract available; and English language (n=322,107. For the DS subsample, we randomly sampled 500 entries from the large sample. The large sample was evaluated with a lower rigor, higher throughput (RLOTHI method using search heuristics, while the subsample was evaluated using a higher rigor, lower throughput (RHITLO human rating method. Multiple imputation of the missing-completely-at-random RHITLO data for the large sample was informed by: RHITLO data from the subsample; RLOTHI data from the large sample; whether a study was an RCT; and country and year of publication. RESULTS: The RHITLO and RLOTHI methods in the subsample largely agreed (phi coefficients: title=1.00, abstract=0.92. Compliance with abstract and title criteria has increased over time, with non-US countries improving more rapidly. DS+MI logistic regression estimates were more precise than subsample estimates (e.g., 95% CI for change in title and abstract compliance by Year: subsample RHITLO 1.050-1.174 vs. DS+MI 1.082-1.151. As evidence of improved accuracy, DS+MI coefficient estimates were closer to RHITLO than the large sample RLOTHI. CONCLUSIONS: Our results support our hypothesis that DS+MI would result in improved precision and accuracy. This method is flexible and may provide a practical way to examine large corpora of
Mansouri, Edris; Feizi, Faranak; Jafari Rad, Alireza; Arian, Mehran
2018-03-01
This paper uses multivariate regression to create a mathematical model for iron skarn exploration in the Sarvian area, central Iran, using multivariate regression for mineral prospectivity mapping (MPM). The main target of this paper is to apply multivariate regression analysis (as an MPM method) to map iron outcrops in the northeastern part of the study area in order to discover new iron deposits in other parts of the study area. Two types of multivariate regression models using two linear equations were employed to discover new mineral deposits. This method is one of the reliable methods for processing satellite images. ASTER satellite images (14 bands) were used as unique independent variables (UIVs), and iron outcrops were mapped as dependent variables for MPM. According to the results of the probability value (p value), coefficient of determination value (R2) and adjusted determination coefficient (Radj2), the second regression model (which consistent of multiple UIVs) fitted better than other models. The accuracy of the model was confirmed by iron outcrops map and geological observation. Based on field observation, iron mineralization occurs at the contact of limestone and intrusive rocks (skarn type).
Misyura, Maksym; Sukhai, Mahadeo A; Kulasignam, Vathany; Zhang, Tong; Kamel-Reid, Suzanne; Stockley, Tracy L
2018-02-01
A standard approach in test evaluation is to compare results of the assay in validation to results from previously validated methods. For quantitative molecular diagnostic assays, comparison of test values is often performed using simple linear regression and the coefficient of determination (R 2 ), using R 2 as the primary metric of assay agreement. However, the use of R 2 alone does not adequately quantify constant or proportional errors required for optimal test evaluation. More extensive statistical approaches, such as Bland-Altman and expanded interpretation of linear regression methods, can be used to more thoroughly compare data from quantitative molecular assays. We present the application of Bland-Altman and linear regression statistical methods to evaluate quantitative outputs from next-generation sequencing assays (NGS). NGS-derived data sets from assay validation experiments were used to demonstrate the utility of the statistical methods. Both Bland-Altman and linear regression were able to detect the presence and magnitude of constant and proportional error in quantitative values of NGS data. Deming linear regression was used in the context of assay comparison studies, while simple linear regression was used to analyse serial dilution data. Bland-Altman statistical approach was also adapted to quantify assay accuracy, including constant and proportional errors, and precision where theoretical and empirical values were known. The complementary application of the statistical methods described in this manuscript enables more extensive evaluation of performance characteristics of quantitative molecular assays, prior to implementation in the clinical molecular laboratory. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Understanding poisson regression.
Hayat, Matthew J; Higgins, Melinda
2014-04-01
Nurse investigators often collect study data in the form of counts. Traditional methods of data analysis have historically approached analysis of count data either as if the count data were continuous and normally distributed or with dichotomization of the counts into the categories of occurred or did not occur. These outdated methods for analyzing count data have been replaced with more appropriate statistical methods that make use of the Poisson probability distribution, which is useful for analyzing count data. The purpose of this article is to provide an overview of the Poisson distribution and its use in Poisson regression. Assumption violations for the standard Poisson regression model are addressed with alternative approaches, including addition of an overdispersion parameter or negative binomial regression. An illustrative example is presented with an application from the ENSPIRE study, and regression modeling of comorbidity data is included for illustrative purposes. Copyright 2014, SLACK Incorporated.
Antropov, K M; Varaksin, A N
2013-01-01
This paper provides the description of Land Use Regression (LUR) modeling and the result of its application in the study of nitrogen dioxide air pollution in Ekaterinburg. The paper describes the difficulties of the modeling for air pollution caused by motor vehicles exhaust, and the ways to address these challenges. To create LUR model of the NO2 air pollution in Ekaterinburg, concentrations of NO2 were measured, data on factors affecting air pollution were collected, a statistical analysis of the data were held. A statistical model of NO2 air pollution (coefficient of determination R2 = 0.70) and a map of pollution were created.
Ali, M Sanni; Groenwold, Rolf H H; Belitser, Svetlana V; Souverein, Patrick C; Martín, Elisa; Gatto, Nicolle M; Huerta, Consuelo; Gardarsdottir, Helga; Roes, Kit C B; Hoes, Arno W; de Boer, Antonius; Klungel, Olaf H
2016-01-01
BACKGROUND: Observational studies including time-varying treatments are prone to confounding. We compared time-varying Cox regression analysis, propensity score (PS) methods, and marginal structural models (MSMs) in a study of antidepressant [selective serotonin reuptake inhibitors (SSRIs)] use and
Freund, Rudolf J; Sa, Ping
2006-01-01
The book provides complete coverage of the classical methods of statistical analysis. It is designed to give students an understanding of the purpose of statistical analyses, to allow the student to determine, at least to some degree, the correct type of statistical analyses to be performed in a given situation, and have some appreciation of what constitutes good experimental design
Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits
I. Tachmazidou (Ioanna); Süveges, D. (Dániel); J. Min (Josine); G.R.S. Ritchie (Graham R.S.); Steinberg, J. (Julia); K. Walter (Klaudia); V. Iotchkova (Valentina); J.A. Schwartzentruber (Jeremy); J. Huang (Jian); Y. Memari (Yasin); McCarthy, S. (Shane); Crawford, A.A. (Andrew A.); C. Bombieri (Cristina); M. Cocca (Massimiliano); A.-E. Farmaki (Aliki-Eleni); T.R. Gaunt (Tom); P. Jousilahti (Pekka); M.N. Kooijman (Marjolein ); Lehne, B. (Benjamin); G. Malerba (Giovanni); S. Männistö (Satu); A. Matchan (Angela); M.C. Medina-Gomez (Carolina); S. Metrustry (Sarah); A. Nag (Abhishek); I. Ntalla (Ioanna); L. Paternoster (Lavinia); N.W. Rayner (Nigel William); C. Sala (Cinzia); W.R. Scott (William R.); H.A. Shihab (Hashem A.); L. Southam (Lorraine); B. St Pourcain (Beate); M. Traglia (Michela); K. Trajanoska (Katerina); Zaza, G. (Gialuigi); W. Zhang (Weihua); M.S. Artigas; Bansal, N. (Narinder); M. Benn (Marianne); Chen, Z. (Zhongsheng); P. Danecek (Petr); Lin, W.-Y. (Wei-Yu); A. Locke (Adam); J. Luan (Jian'An); A.K. Manning (Alisa); Mulas, A. (Antonella); C. Sidore (Carlo); A. Tybjaerg-Hansen; A. Varbo (Anette); M. Zoledziewska (Magdalena); C. Finan (Chris); Hatzikotoulas, K. (Konstantinos); A.E. Hendricks (Audrey E.); J.P. Kemp (John); A. Moayyeri (Alireza); Panoutsopoulou, K. (Kalliope); Szpak, M. (Michal); S.G. Wilson (Scott); M. Boehnke (Michael); F. Cucca (Francesco); Di Angelantonio, E. (Emanuele); C. Langenberg (Claudia); C.M. Lindgren (Cecilia M.); McCarthy, M.I. (Mark I.); A.P. Morris (Andrew); B.G. Nordestgaard (Børge); R.A. Scott (Robert); M.D. Tobin (Martin); N.J. Wareham (Nick); P.R. Burton (Paul); J.C. Chambers (John); Smith, G.D. (George Davey); G.V. Dedoussis (George); J.F. Felix (Janine); O.H. Franco (Oscar); Gambaro, G. (Giovanni); P. Gasparini (Paolo); C.J. Hammond (Christopher J.); A. Hofman (Albert); V.W.V. Jaddoe (Vincent); M.E. Kleber (Marcus); J.S. Kooner (Jaspal S.); M. Perola (Markus); C.L. Relton (Caroline); S.M. Ring (Susan); F. Rivadeneira Ramirez (Fernando); V. Salomaa (Veikko); T.D. Spector (Timothy); O. Stegle (Oliver); D. Toniolo (Daniela); A.G. Uitterlinden (André); I.E. Barroso (Inês); C.M.T. Greenwood (Celia); Perry, J.R.B. (John R.B.); Walker, B.R. (Brian R.); A.S. Butterworth (Adam); Y. Xue (Yali); R. Durbin (Richard); K.S. Small (Kerrin); N. Soranzo (Nicole); N.J. Timpson (Nicholas); E. Zeggini (Eleftheria)
2016-01-01
textabstractDeep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the
Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits
DEFF Research Database (Denmark)
Tachmazidou, Ioanna; Süveges, Dániel; Min, Josine L
2017-01-01
Deep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader alleli...
48 CFR 1830.7002-4 - Determining imputed cost of money.
2010-10-01
... money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS AND... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...
[Imputing missing data in public health: general concepts and application to dichotomous variables].
Hernández, Gilma; Moriña, David; Navarro, Albert
The presence of missing data in collected variables is common in health surveys, but the subsequent imputation thereof at the time of analysis is not. Working with imputed data may have certain benefits regarding the precision of the estimators and the unbiased identification of associations between variables. The imputation process is probably still little understood by many non-statisticians, who view this process as highly complex and with an uncertain goal. To clarify these questions, this note aims to provide a straightforward, non-exhaustive overview of the imputation process to enable public health researchers ascertain its strengths. All this in the context of dichotomous variables which are commonplace in public health. To illustrate these concepts, an example in which missing data is handled by means of simple and multiple imputation is introduced. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.
Imputing data that are missing at high rates using a boosting algorithm
Energy Technology Data Exchange (ETDEWEB)
Cauthen, Katherine Regina [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lambert, Gregory [Apple Inc., Cupertino, CA (United States); Ray, Jaideep [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Lefantzi, Sophia [Sandia National Lab. (SNL-CA), Livermore, CA (United States)
2016-09-01
Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. Here, we use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, and the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.
Lee, Mi Hee; Lee, Soo Bong; Eo, Yang Dam; Kim, Sun Woong; Woo, Jung-Hun; Han, Soo Hee
2017-07-01
Landsat optical images have enough spatial and spectral resolution to analyze vegetation growth characteristics. But, the clouds and water vapor degrade the image quality quite often, which limits the availability of usable images for the time series vegetation vitality measurement. To overcome this shortcoming, simulated images are used as an alternative. In this study, weighted average method, spatial and temporal adaptive reflectance fusion model (STARFM) method, and multilinear regression analysis method have been tested to produce simulated Landsat normalized difference vegetation index (NDVI) images of the Korean Peninsula. The test results showed that the weighted average method produced the images most similar to the actual images, provided that the images were available within 1 month before and after the target date. The STARFM method gives good results when the input image date is close to the target date. Careful regional and seasonal consideration is required in selecting input images. During summer season, due to clouds, it is very difficult to get the images close enough to the target date. Multilinear regression analysis gives meaningful results even when the input image date is not so close to the target date. Average R 2 values for weighted average method, STARFM, and multilinear regression analysis were 0.741, 0.70, and 0.61, respectively.
Energy Technology Data Exchange (ETDEWEB)
Zhang, Yan-Feng; Dai, Shu-Gui [College of Environmental Science and Engineering, Nankai University, Key Laboratory for Pollution Process and Environmental Criteria of Ministry of Education, Tianjin (China); Ma, Yi [College of Chemistry, Nankai University, Institute of Elemento-Organic Chemistry, Tianjin (China); Gao, Zhi-Xian [Institute of Hygiene and Environmental Medicine, Tianjin (China)
2010-07-15
Immunoassays have been regarded as a possible alternative or supplement for measuring polycyclic aromatic hydrocarbons (PAHs) in the environment. Since there are too many potential cross-reactants for PAH immunoassays, it is difficult to determine all the cross-reactivities (CRs) by experimental tests. The relationship between CR and the physical-chemical properties of PAHs and related compounds was investigated using the CR data from a commercial enzyme-linked immunosorbent assay (ELISA) kit test. Two quantitative structure-activity relationship (QSAR) techniques, regression analysis and comparative molecular field analysis (CoMFA), were applied for predicting the CR of PAHs in this ELISA kit. Parabolic regression indicates that the CRs are significantly correlated with the logarithm of the partition coefficient for the octanol-water system (log K{sub ow}) (r{sup 2}=0.643, n=23, P<0.0001), suggesting that hydrophobic interactions play an important role in the antigen-antibody binding and the cross-reactions in this ELISA test. The CoMFA model obtained shows that the CRs of the PAHs are correlated with the 3D structure of the molecules (r{sub cv}{sup 2}=0.663, r{sup 2}=0.873, F{sub 4,32}=55.086). The contributions of the steric and electrostatic fields to CR were 40.4 and 59.6%, respectively. Both of the QSAR models satisfactorily predict the CR in this PAH immunoassay kit, and help in understanding the mechanisms of antigen-antibody interaction. (orig.)
Directory of Open Access Journals (Sweden)
Mehmet Das
2018-01-01
Full Text Available In this study, an air heated solar collector (AHSC dryer was designed to determine the drying characteristics of the pear. Flat pear slices of 10 mm thickness were used in the experiments. The pears were dried both in the AHSC dryer and under the sun. Panel glass temperature, panel floor temperature, panel inlet temperature, panel outlet temperature, drying cabinet inlet temperature, drying cabinet outlet temperature, drying cabinet temperature, drying cabinet moisture, solar radiation, pear internal temperature, air velocity and mass loss of pear were measured at 30 min intervals. Experiments were carried out during the periods of June 2017 in Elazig, Turkey. The experiments started at 8:00 a.m. and continued till 18:00. The experiments were continued until the weight changes in the pear slices stopped. Wet basis moisture content (MCw, dry basis moisture content (MCd, adjustable moisture ratio (MR, drying rate (DR, and convective heat transfer coefficient (hc were calculated with both in the AHSC dryer and the open sun drying experiment data. It was found that the values of hc in both drying systems with a range 12.4 and 20.8 W/m2 °C. Three different kernel models were used in the support vector machine (SVM regression to construct the predictive model of the calculated hc values for both systems. The mean absolute error (MAE, root mean squared error (RMSE, relative absolute error (RAE and root relative absolute error (RRAE analysis were performed to indicate the predictive model’s accuracy. As a result, the rate of drying of the pear was examined for both systems and it was observed that the pear had dried earlier in the AHSC drying system. A predictive model was obtained using the SVM regression for the calculated hc values for the pear in the AHSC drying system. The normalized polynomial kernel was determined as the best kernel model in SVM for estimating the hc values.
Gross, Samuel M; Tibshirani, Robert
2015-04-01
We consider the scenario where one observes an outcome variable and sets of features from multiple assays, all measured on the same set of samples. One approach that has been proposed for dealing with these type of data is "sparse multiple canonical correlation analysis" (sparse mCCA). All of the current sparse mCCA techniques are biconvex and thus have no guarantees about reaching a global optimum. We propose a method for performing sparse supervised canonical correlation analysis (sparse sCCA), a specific case of sparse mCCA when one of the datasets is a vector. Our proposal for sparse sCCA is convex and thus does not face the same difficulties as the other methods. We derive efficient algorithms for this problem that can be implemented with off the shelf solvers, and illustrate their use on simulated and real data. © The Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Tokunaga, Makoto; Watanabe, Susumu; Sonoda, Shigeru
2017-09-01
Multiple linear regression analysis is often used to predict the outcome of stroke rehabilitation. However, the predictive accuracy may not be satisfactory. The objective of this study was to elucidate the predictive accuracy of a method of calculating motor Functional Independence Measure (mFIM) at discharge from mFIM effectiveness predicted by multiple regression analysis. The subjects were 505 patients with stroke who were hospitalized in a convalescent rehabilitation hospital. The formula "mFIM at discharge = mFIM effectiveness × (91 points - mFIM at admission) + mFIM at admission" was used. By including the predicted mFIM effectiveness obtained through multiple regression analysis in this formula, we obtained the predicted mFIM at discharge (A). We also used multiple regression analysis to directly predict mFIM at discharge (B). The correlation between the predicted and the measured values of mFIM at discharge was compared between A and B. The correlation coefficients were .916 for A and .878 for B. Calculating mFIM at discharge from mFIM effectiveness predicted by multiple regression analysis had a higher degree of predictive accuracy of mFIM at discharge than that directly predicted. Copyright © 2017 National Stroke Association. Published by Elsevier Inc. All rights reserved.
Jacobsen, R. T.; Stewart, R. B.; Crain, R. W., Jr.; Rose, G. L.; Myers, A. F.
1976-01-01
A method was developed for establishing a rational choice of the terms to be included in an equation of state with a large number of adjustable coefficients. The methods presented were developed for use in the determination of an equation of state for oxygen and nitrogen. However, a general application of the methods is possible in studies involving the determination of an optimum polynomial equation for fitting a large number of data points. The data considered in the least squares problem are experimental thermodynamic pressure-density-temperature data. Attention is given to a description of stepwise multiple regression and the use of stepwise regression in the determination of an equation of state for oxygen and nitrogen.
Jović, Ozren; Smrečki, Neven; Popović, Zora
2016-04-01
A novel quantitative prediction and variable selection method called interval ridge regression (iRR) is studied in this work. The method is performed on six data sets of FTIR, two data sets of UV-vis and one data set of DSC. The obtained results show that models built with ridge regression on optimal variables selected with iRR significantly outperfom models built with ridge regression on all variables in both calibration (6 out of 9 cases) and validation (2 out of 9 cases). In this study, iRR is also compared with interval partial least squares regression (iPLS). iRR outperfomed iPLS in validation (insignificantly in 6 out of 9 cases and significantly in one out of 9 cases for poil, a well known health beneficial nutrient, is studied in this work by mixing it with cheap and widely used oils such as soybean (So) oil, rapeseed (R) oil and sunflower (Su) oil. Binary mixture sets of hempseed oil with these three oils (HSo, HR and HSu) and a ternary mixture set of H oil, R oil and Su oil (HRSu) were considered. The obtained accuracy indicates that using iRR on FTIR and UV-vis data, each particular oil can be very successfully quantified (in all 8 cases RMSEPoil (R(2)>0.99). Copyright © 2015 Elsevier B.V. All rights reserved.
Setiawan, Suhartono, Ahmad, Imam Safawi; Rahmawati, Noorgam Ika
2015-12-01
Bank Indonesia (BI) as the central bank of Republic Indonesiahas a single overarching objective to establish and maintain rupiah stability. This objective could be achieved by monitoring traffic of inflow and outflow money currency. Inflow and outflow are related to stock and distribution of money currency around Indonesia territory. It will effect of economic activities. Economic activities of Indonesia,as one of Moslem country, absolutely related to Islamic Calendar (lunar calendar), that different with Gregorian calendar. This research aims to forecast the inflow and outflow money currency of Representative Office (RO) of BI Semarang Central Java region. The results of the analysis shows that the characteristics of inflow and outflow money currency influenced by the effects of the calendar variations, that is the day of Eid al-Fitr (moslem holyday) as well as seasonal patterns. In addition, the period of a certain week during Eid al-Fitr also affect the increase of inflow and outflow money currency. The best model based on the value of the smallestRoot Mean Square Error (RMSE) for inflow data is ARIMA model. While the best model for predicting the outflow data in RO of BI Semarang is ARIMAX model or Time Series Regression, because both of them have the same model. The results forecast in a period of 2015 shows an increase of inflow money currency happened in August, while the increase in outflow money currency happened in July.
Directory of Open Access Journals (Sweden)
Matthias Schmid
Full Text Available Regression analysis with a bounded outcome is a common problem in applied statistics. Typical examples include regression models for percentage outcomes and the analysis of ratings that are measured on a bounded scale. In this paper, we consider beta regression, which is a generalization of logit models to situations where the response is continuous on the interval (0,1. Consequently, beta regression is a convenient tool for analyzing percentage responses. The classical approach to fit a beta regression model is to use maximum likelihood estimation with subsequent AIC-based variable selection. As an alternative to this established - yet unstable - approach, we propose a new estimation technique called boosted beta regression. With boosted beta regression estimation and variable selection can be carried out simultaneously in a highly efficient way. Additionally, both the mean and the variance of a percentage response can be modeled using flexible nonlinear covariate effects. As a consequence, the new method accounts for common problems such as overdispersion and non-binomial variance structures.
Doran, Kara S.; Howd, Peter A.; Sallenger,, Asbury H.
2016-01-04
This report documents the development of statistical tools used to quantify the hazard presented by the response of sea-level elevation to natural or anthropogenic changes in climate and ocean circulation. A hazard is a physical process (or processes) that, when combined with vulnerability (or susceptibility to the hazard), results in risk. This study presents the development and comparison of new and existing sea-level analysis methods, exploration of the strengths and weaknesses of the methods using synthetic time series, and when appropriate, synthesis of the application of the method to observed sea-level time series. These reports are intended to enhance material presented in peer-reviewed journal articles where it is not always possible to provide the level of detail that might be necessary to fully support or recreate published results.
International Nuclear Information System (INIS)
Hu, Chao; Jain, Gaurav; Zhang, Puqiang; Schmidt, Craig; Gomadam, Parthasarathy; Gorka, Tom
2014-01-01
Highlights: • We develop a data-driven method for the battery capacity estimation. • Five charge-related features that are indicative of the capacity are defined. • The kNN regression model captures the dependency of the capacity on the features. • Results with 10 years’ continuous cycling data verify the effectiveness of the method. - Abstract: Reliability of lithium-ion (Li-ion) rechargeable batteries used in implantable medical devices has been recognized as of high importance from a broad range of stakeholders, including medical device manufacturers, regulatory agencies, physicians, and patients. To ensure Li-ion batteries in these devices operate reliably, it is important to be able to assess the battery health condition by estimating the battery capacity over the life-time. This paper presents a data-driven method for estimating the capacity of Li-ion battery based on the charge voltage and current curves. The contributions of this paper are three-fold: (i) the definition of five characteristic features of the charge curves that are indicative of the capacity, (ii) the development of a non-linear kernel regression model, based on the k-nearest neighbor (kNN) regression, that captures the complex dependency of the capacity on the five features, and (iii) the adaptation of particle swarm optimization (PSO) to finding the optimal combination of feature weights for creating a kNN regression model that minimizes the cross validation (CV) error in the capacity estimation. Verification with 10 years’ continuous cycling data suggests that the proposed method is able to accurately estimate the capacity of Li-ion battery throughout the whole life-time
International Nuclear Information System (INIS)
Briggs, D.J.; De Hoogh, C.; Elliot, P.; Gulliver, J.; Wills, J.; Kingham, S.; Smallbone, K.
2000-01-01
Accurate, high-resolution maps of traffic-related air pollution are needed both as a basis for assessing exposures as part of epidemiological studies, and to inform urban air-quality policy and traffic management. This paper assesses the use of a GIS-based, regression mapping technique to model spatial patterns of traffic-related air pollution. The model - developed using data from 80 passive sampler sites in Huddersfield, as part of the SAVIAH (Small Area Variations in Air Quality and Health) project - uses data on traffic flows and land cover in the 300-m buffer zone around each site, and altitude of the site, as predictors of NO 2 concentrations. It was tested here by application in four urban areas in the UK: Huddersfield (for the year following that used for initial model development), Sheffield, Northampton, and part of London. In each case, a GIS was built in ArcInfo, integrating relevant data on road traffic, urban land use and topography. Monitoring of NO 2 was undertaken using replicate passive samplers (in London, data were obtained from surveys carried out as part of the London network). In Huddersfield, Sheffield and Northampton, the model was first calibrated by comparing modelled results with monitored NO 2 concentrations at 10 randomly selected sites; the calibrated model was then validated against data from a further 10-28 sites. In London, where data for only 11 sites were available, validation was not undertaken. Results showed that the model performed well in all cases. After local calibration, the model gave estimates of mean annual NO 2 concentrations within a factor of 1.5 of the actual mean (approx. 70-90%) of the time and within a factor of 2 between 70 and 100% of the time. r 2 values between modelled and observed concentrations are in the range of 0.58-0.76. These results are comparable to those achieved by more sophisticated dispersion models. The model also has several advantages over dispersion modelling. It is able, for example, to
Genotype Imputation for Latinos Using the HapMap and 1000 Genomes Project Reference Panels
Directory of Open Access Journals (Sweden)
Xiaoyi eGao
2012-06-01
Full Text Available Genotype imputation is a vital tool in genome-wide association studies (GWAS and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR+CEU+YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation-based analysis in Latinos.
Silva, João Paulo Santos; Mônaco, Luciana da Mata; Paschoal, André Monteiro; Oliveira, Ícaro Agenor Ferreira de; Leoni, Renata Ferranti
2018-05-16
Arterial spin labeling (ASL) is an established magnetic resonance imaging (MRI) technique that is finding broader applications in functional studies of the healthy and diseased brain. To promote improvement in cerebral blood flow (CBF) signal specificity, many algorithms and imaging procedures, such as subtraction methods, were proposed to eliminate or, at least, minimize noise sources. Therefore, this study addressed the main considerations of how CBF functional connectivity (FC) is changed, regarding resting brain network (RBN) identification and correlations between regions of interest (ROI), by different subtraction methods and removal of residual motion artifacts and global signal fluctuations (RMAGSF). Twenty young healthy participants (13 M/7F, mean age = 25 ± 3 years) underwent an MRI protocol with a pseudo-continuous ASL (pCASL) sequence. Perfusion-based images were obtained using simple, sinc and running subtraction. RMAGSF removal was applied to all CBF time series. Independent Component Analysis (ICA) was used for RBN identification, while Pearson' correlation was performed for ROI-based FC analysis. Temporal signal-to-noise ratio (tSNR) was higher in CBF maps obtained by sinc subtraction, although RMAGSF removal had a significant effect on maps obtained with simple and running subtractions. Neither the subtraction method nor the RMAGSF removal directly affected the identification of RBNs. However, the number of correlated and anti-correlated voxels varied for different subtraction and filtering methods. In an ROI-to-ROI level, changes were prominent in FC values and their statistical significance. Our study showed that both RMAGSF filtering and subtraction method might influence resting-state FC results, especially in an ROI level, consequently affecting FC analysis and its interpretation. Taking our results and the whole discussion together, we understand that for an exploratory assessment of the brain, one could avoid removing RMAGSF to
Al-Harrasi, Ahmed; Rehman, Najeeb Ur; Mabood, Fazal; Albroumi, Muhammaed; Ali, Liaqat; Hussain, Javid; Hussain, Hidayat; Csuk, René; Khan, Abdul Latif; Alam, Tanveer; Alameri, Saif
2017-09-01
In the present study, for the first time, NIR spectroscopy coupled with PLS regression as a rapid and alternative method was developed to quantify the amount of Keto-β-Boswellic Acid (KBA) in different plant parts of Boswellia sacra and the resin exudates of the trunk. NIR spectroscopy was used for the measurement of KBA standards and B. sacra samples in absorption mode in the wavelength range from 700-2500 nm. PLS regression model was built from the obtained spectral data using 70% of KBA standards (training set) in the range from 0.1 ppm to 100 ppm. The PLS regression model obtained was having R-square value of 98% with 0.99 corelationship value and having good prediction with RMSEP value 3.2 and correlation of 0.99. It was then used to quantify the amount of KBA in the samples of B. sacra. The results indicated that the MeOH extract of resin has the highest concentration of KBA (0.6%) followed by essential oil (0.1%). However, no KBA was found in the aqueous extract. The MeOH extract of the resin was subjected to column chromatography to get various sub-fractions at different polarity of organic solvents. The sub-fraction at 4% MeOH/CHCl3 (4.1% of KBA) was found to contain the highest percentage of KBA followed by another sub-fraction at 2% MeOH/CHCl3 (2.2% of KBA). The present results also indicated that KBA is only present in the gum-resin of the trunk and not in all parts of the plant. These results were further confirmed through HPLC analysis and therefore it is concluded that NIRS coupled with PLS regression is a rapid and alternate method for quantification of KBA in Boswellia sacra. It is non-destructive, rapid, sensitive and uses simple methods of sample preparation.
DEFF Research Database (Denmark)
Ma, Peipei; Lund, Mogens Sandø; Ding, X
2015-01-01
This study investigated the effect of including Nordic Holsteins in the reference population on the imputation accuracy and prediction accuracy for Chinese Holsteins. The data used in this study include 85 Chinese Holstein bulls genotyped with both 54K chip and 777K (HD) chip, 2862 Chinese cows...... was improved slightly when using the marker data imputed based on the combined HD reference data, compared with using the marker data imputed based on the Chinese HD reference data only. On the other hand, when using the combined reference population including 4398 Nordic Holstein bulls, the accuracy...... to increase reference population rather than increasing marker density...
Polynomial regression analysis and significance test of the regression function
International Nuclear Information System (INIS)
Gao Zhengming; Zhao Juan; He Shengping
2012-01-01
In order to analyze the decay heating power of a certain radioactive isotope per kilogram with polynomial regression method, the paper firstly demonstrated the broad usage of polynomial function and deduced its parameters with ordinary least squares estimate. Then significance test method of polynomial regression function is derived considering the similarity between the polynomial regression model and the multivariable linear regression model. Finally, polynomial regression analysis and significance test of the polynomial function are done to the decay heating power of the iso tope per kilogram in accord with the authors' real work. (authors)
Multiple imputation of rainfall missing data in the Iberian Mediterranean context
Miró, Juan Javier; Caselles, Vicente; Estrela, María José
2017-11-01
Given the increasing need for complete rainfall data networks, in recent years have been proposed diverse methods for filling gaps in observed precipitation series, progressively more advanced that traditional approaches to overcome the problem. The present study has consisted in validate 10 methods (6 linear, 2 non-linear and 2 hybrid) that allow multiple imputation, i.e., fill at the same time missing data of multiple incomplete series in a dense network of neighboring stations. These were applied for daily and monthly rainfall in two sectors in the Júcar River Basin Authority (east Iberian Peninsula), which is characterized by a high spatial irregularity and difficulty of rainfall estimation. A classification of precipitation according to their genetic origin was applied as pre-processing, and a quantile-mapping adjusting as post-processing technique. The results showed in general a better performance for the non-linear and hybrid methods, highlighting that the non-linear PCA (NLPCA) method outperforms considerably the Self Organizing Maps (SOM) method within non-linear approaches. On linear methods, the Regularized Expectation Maximization method (RegEM) was the best, but far from NLPCA. Applying EOF filtering as post-processing of NLPCA (hybrid approach) yielded the best results.
Seber, George A F
2012-01-01
Concise, mathematically clear, and comprehensive treatment of the subject.* Expanded coverage of diagnostics and methods of model fitting.* Requires no specialized knowledge beyond a good grasp of matrix algebra and some acquaintance with straight-line regression and simple analysis of variance models.* More than 200 problems throughout the book plus outline solutions for the exercises.* This revision has been extensively class-tested.
Balabin, Roman M; Lomakina, Ekaterina I
2011-04-21
In this study, we make a general comparison of the accuracy and robustness of five multivariate calibration models: partial least squares (PLS) regression or projection to latent structures, polynomial partial least squares (Poly-PLS) regression, artificial neural networks (ANNs), and two novel techniques based on support vector machines (SVMs) for multivariate data analysis: support vector regression (SVR) and least-squares support vector machines (LS-SVMs). The comparison is based on fourteen (14) different datasets: seven sets of gasoline data (density, benzene content, and fractional composition/boiling points), two sets of ethanol gasoline fuel data (density and ethanol content), one set of diesel fuel data (total sulfur content), three sets of petroleum (crude oil) macromolecules data (weight percentages of asphaltenes, resins, and paraffins), and one set of petroleum resins data (resins content). Vibrational (near-infrared, NIR) spectroscopic data are used to predict the properties and quality coefficients of gasoline, biofuel/biodiesel, diesel fuel, and other samples of interest. The four systems presented here range greatly in composition, properties, strength of intermolecular interactions (e.g., van der Waals forces, H-bonds), colloid structure, and phase behavior. Due to the high diversity of chemical systems studied, general conclusions about SVM regression methods can be made. We try to answer the following question: to what extent can SVM-based techniques replace ANN-based approaches in real-world (industrial/scientific) applications? The results show that both SVR and LS-SVM methods are comparable to ANNs in accuracy. Due to the much higher robustness of the former, the SVM-based approaches are recommended for practical (industrial) application. This has been shown to be especially true for complicated, highly nonlinear objects.
Farhadian, Maryam; Aliabadi, Mohsen; Darvishi, Ebrahim
2015-01-01
Prediction models are used in a variety of medical domains, and they are frequently built from experience which constitutes data acquired from actual cases. This study aimed to analyze the potential of artificial neural networks and logistic regression techniques for estimation of hearing impairment among industrial workers. A total of 210 workers employed in a steel factory (in West of Iran) were selected, and their occupational exposure histories were analyzed. The hearing loss thresholds of the studied workers were determined using a calibrated audiometer. The personal noise exposures were also measured using a noise dosimeter in the workstations. Data obtained from five variables, which can influence the hearing loss, were used as input features, and the hearing loss thresholds were considered as target feature of the prediction methods. Multilayer feedforward neural networks and logistic regression were developed using MATLAB R2011a software. Based on the World Health Organization classification for the grades of hearing loss, 74.2% of the studied workers have normal hearing thresholds, 23.4% have slight hearing loss, and 2.4% have moderate hearing loss. The accuracy and kappa coefficient of the best developed neural networks for prediction of the grades of hearing loss were 88.6 and 66.30, respectively. The accuracy and kappa coefficient of the logistic regression were also 84.28 and 51.30, respectively. Neural networks could provide more accurate predictions of the hearing loss than logistic regression. The prediction method can provide reliable and comprehensible information for occupational health and medicine experts.
Donnelly, Aoife; Misstear, Bruce; Broderick, Brian
2011-02-15
Background concentrations of nitrogen dioxide (NO(2)) are not constant but vary temporally and spatially. The current paper presents a powerful tool for the quantification of the effects of wind direction and wind speed on background NO(2) concentrations, particularly in cases where monitoring data are limited. In contrast to previous studies which applied similar methods to sites directly affected by local pollution sources, the current study focuses on background sites with the aim of improving methods for predicting background concentrations adopted in air quality modelling studies. The relationship between measured NO(2) concentration in air at three such sites in Ireland and locally measured wind direction has been quantified using nonparametric regression methods. The major aim was to analyse a method for quantifying the effects of local wind direction on background levels of NO(2) in Ireland. The method was expanded to include wind speed as an added predictor variable. A Gaussian kernel function is used in the analysis and circular statistics employed for the wind direction variable. Wind direction and wind speed were both found to have a statistically significant effect on background levels of NO(2) at all three sites. Frequently environmental impact assessments are based on short term baseline monitoring producing a limited dataset. The presented non-parametric regression methods, in contrast to the frequently used methods such as binning of the data, allow concentrations for missing data pairs to be estimated and distinction between spurious and true peaks in concentrations to be made. The methods were found to provide a realistic estimation of long term concentration variation with wind direction and speed, even for cases where the data set is limited. Accurate identification of the actual variation at each location and causative factors could be made, thus supporting the improved definition of background concentrations for use in air quality modelling
Linear regression in astronomy. I
Isobe, Takashi; Feigelson, Eric D.; Akritas, Michael G.; Babu, Gutti Jogesh
1990-01-01
Five methods for obtaining linear regression fits to bivariate data with unknown or insignificant measurement errors are discussed: ordinary least-squares (OLS) regression of Y on X, OLS regression of X on Y, the bisector of the two OLS lines, orthogonal regression, and 'reduced major-axis' regression. These methods have been used by various researchers in observational astronomy, most importantly in cosmic distance scale applications. Formulas for calculating the slope and intercept coefficients and their uncertainties are given for all the methods, including a new general form of the OLS variance estimates. The accuracy of the formulas was confirmed using numerical simulations. The applicability of the procedures is discussed with respect to their mathematical properties, the nature of the astronomical data under consideration, and the scientific purpose of the regression. It is found that, for problems needing symmetrical treatment of the variables, the OLS bisector performs significantly better than orthogonal or reduced major-axis regression.
Classification and regression trees
Breiman, Leo; Olshen, Richard A; Stone, Charles J
1984-01-01
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Kim, Ji-Sung; Gao, Xin; Rzhetsky, Andrey
2018-01-01
are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race
Barnwell-Ménard, Jean-Louis; Li, Qing; Cohen, Alan A
2015-03-15
The loss of signal associated with categorizing a continuous variable is well known, and previous studies have demonstrated that this can lead to an inflation of Type-I error when the categorized variable is a confounder in a regression analysis estimating the effect of an exposure on an outcome. However, it is not known how the Type-I error may vary under different circumstances, including logistic versus linear regression, different distributions of the confounder, and different categorization methods. Here, we analytically quantified the effect of categorization and then performed a series of 9600 Monte Carlo simulations to estimate the Type-I error inflation associated with categorization of a confounder under different regression scenarios. We show that Type-I error is unacceptably high (>10% in most scenarios and often 100%). The only exception was when the variable categorized was a continuous mixture proxy for a genuinely dichotomous latent variable, where both the continuous proxy and the categorized variable are error-ridden proxies for the dichotomous latent variable. As expected, error inflation was also higher with larger sample size, fewer categories, and stronger associations between the confounder and the exposure or outcome. We provide online tools that can help researchers estimate the potential error inflation and understand how serious a problem this is. Copyright © 2014 John Wiley & Sons, Ltd.
Directory of Open Access Journals (Sweden)
Mok Tik
2014-06-01
Full Text Available This study formulates regression of vector data that will enable statistical analysis of various geodetic phenomena such as, polar motion, ocean currents, typhoon/hurricane tracking, crustal deformations, and precursory earthquake signals. The observed vector variable of an event (dependent vector variable is expressed as a function of a number of hypothesized phenomena realized also as vector variables (independent vector variables and/or scalar variables that are likely to impact the dependent vector variable. The proposed representation has the unique property of solving the coefficients of independent vector variables (explanatory variables also as vectors, hence it supersedes multivariate multiple regression models, in which the unknown coefficients are scalar quantities. For the solution, complex numbers are used to rep- resent vector information, and the method of least squares is deployed to estimate the vector model parameters after transforming the complex vector regression model into a real vector regression model through isomorphism. Various operational statistics for testing the predictive significance of the estimated vector parameter coefficients are also derived. A simple numerical example demonstrates the use of the proposed vector regression analysis in modeling typhoon paths.
Simple nuclear norm based algorithms for imputing missing data and forecasting in time series
Butcher, Holly Louise; Gillard, Jonathan William
2017-01-01
There has been much recent progress on the use of the nuclear norm for the so-called matrix completion problem (the problem of imputing missing values of a matrix). In this paper we investigate the use of the nuclear norm for modelling time series, with particular attention to imputing missing data and forecasting. We introduce a simple alternating projections type algorithm based on the nuclear norm for these tasks, and consider a number of practical examples.
Differentiating regressed melanoma from regressed lichenoid keratosis.
Chan, Aegean H; Shulman, Kenneth J; Lee, Bonnie A
2017-04-01
Distinguishing regressed lichen planus-like keratosis (LPLK) from regressed melanoma can be difficult on histopathologic examination, potentially resulting in mismanagement of patients. We aimed to identify histopathologic features by which regressed melanoma can be differentiated from regressed LPLK. Twenty actively inflamed LPLK, 12 LPLK with regression and 15 melanomas with regression were compared and evaluated by hematoxylin and eosin staining as well as Melan-A, microphthalmia transcription factor (MiTF) and cytokeratin (AE1/AE3) immunostaining. (1) A total of 40% of regressed melanomas showed complete or near complete loss of melanocytes within the epidermis with Melan-A and MiTF immunostaining, while 8% of regressed LPLK exhibited this finding. (2) Necrotic keratinocytes were seen in the epidermis in 33% regressed melanomas as opposed to all of the regressed LPLK. (3) A dense infiltrate of melanophages in the papillary dermis was seen in 40% of regressed melanomas, a feature not seen in regressed LPLK. In summary, our findings suggest that a complete or near complete loss of melanocytes within the epidermis strongly favors a regressed melanoma over a regressed LPLK. In addition, necrotic epidermal keratinocytes and the presence of a dense band-like distribution of dermal melanophages can be helpful in differentiating these lesions. © 2016 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Fridgeirsdottir, Gudrun A; Harris, Robert J; Dryden, Ian L; Fischer, Peter M; Roberts, Clive J
2018-03-29
Solid dispersions can be a successful way to enhance the bioavailability of poorly soluble drugs. Here 60 solid dispersion formulations were produced using ten chemically diverse, neutral, poorly soluble drugs, three commonly used polymers, and two manufacturing techniques, spray-drying and melt extrusion. Each formulation underwent a six-month stability study at accelerated conditions, 40 °C and 75% relative humidity (RH). Significant differences in times to crystallization (onset of crystallization) were observed between both the different polymers and the two processing methods. Stability from zero days to over one year was observed. The extensive experimental data set obtained from this stability study was used to build multiple linear regression models to correlate physicochemical properties of the active pharmaceutical ingredients (API) with the stability data. The purpose of these models is to indicate which combination of processing method and polymer carrier is most likely to give a stable solid dispersion. Six quantitative mathematical multiple linear regression-based models were produced based on selection of the most influential independent physical and chemical parameters from a set of 33 possible factors, one model for each combination of polymer and processing method, with good predictability of stability. Three general rules are proposed from these models for the formulation development of suitably stable solid dispersions. Namely, increased stability is correlated with increased glass transition temperature ( T g ) of solid dispersions, as well as decreased number of H-bond donors and increased molecular flexibility (such as rotatable bonds and ring count) of the drug molecule.
Missing Value Imputation Based on Gaussian Mixture Model for the Internet of Things
Directory of Open Access Journals (Sweden)
Xiaobo Yan
2015-01-01
Full Text Available This paper addresses missing value imputation for the Internet of Things (IoT. Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the reduction in the accuracy and reliability of the data analysis results. This paper, for the characteristics of the data itself and the features of missing data in IoT, divides the missing data into three types and defines three corresponding missing value imputation problems. Then, we propose three new models to solve the corresponding problems, and they are model of missing value imputation based on context and linear mean (MCL, model of missing value imputation based on binary search (MBS, and model of missing value imputation based on Gaussian mixture model (MGI. Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.
Rossi, M.; Apuani, T.; Felletti, F.
2009-04-01
The aim of this paper is to compare the results of two statistical methods for landslide susceptibility analysis: 1) univariate probabilistic method based on landslide susceptibility index, 2) multivariate method (logistic regression). The study area is the Febbraro valley, located in the central Italian Alps, where different types of metamorphic rocks croup out. On the eastern part of the studied basin a quaternary cover represented by colluvial and secondarily, by glacial deposits, is dominant. In this study 110 earth flows, mainly located toward NE portion of the catchment, were analyzed. They involve only the colluvial deposits and their extension mainly ranges from 36 to 3173 m2. Both statistical methods require to establish a spatial database, in which each landslide is described by several parameters that can be assigned using a main scarp central point of landslide. The spatial database is constructed using a Geographical Information System (GIS). Each landslide is described by several parameters corresponding to the value of main scarp central point of the landslide. Based on bibliographic review a total of 15 predisposing factors were utilized. The width of the intervals, in which the maps of the predisposing factors have to be reclassified, has been defined assuming constant intervals to: elevation (100 m), slope (5 °), solar radiation (0.1 MJ/cm2/year), profile curvature (1.2 1/m), tangential curvature (2.2 1/m), drainage density (0.5), lineament density (0.00126). For the other parameters have been used the results of the probability-probability plots analysis and the statistical indexes of landslides site. In particular slope length (0 ÷ 2, 2 ÷ 5, 5 ÷ 10, 10 ÷ 20, 20 ÷ 35, 35 ÷ 260), accumulation flow (0 ÷ 1, 1 ÷ 2, 2 ÷ 5, 5 ÷ 12, 12 ÷ 60, 60 ÷27265), Topographic Wetness Index 0 ÷ 0.74, 0.74 ÷ 1.94, 1.94 ÷ 2.62, 2.62 ÷ 3.48, 3.48 ÷ 6,00, 6.00 ÷ 9.44), Stream Power Index (0 ÷ 0.64, 0.64 ÷ 1.28, 1.28 ÷ 1.81, 1.81 ÷ 4.20, 4.20 ÷ 9
Ozdemir, Adnan
2011-07-01
SummaryThe purpose of this study is to produce a groundwater spring potential map of the Sultan Mountains in central Turkey, based on a logistic regression method within a Geographic Information System (GIS) environment. Using field surveys, the locations of the springs (440 springs) were determined in the study area. In this study, 17 spring-related factors were used in the analysis: geology, relative permeability, land use/land cover, precipitation, elevation, slope, aspect, total curvature, plan curvature, profile curvature, wetness index, stream power index, sediment transport capacity index, distance to drainage, distance to fault, drainage density, and fault density map. The coefficients of the predictor variables were estimated using binary logistic regression analysis and were used to calculate the groundwater spring potential for the entire study area. The accuracy of the final spring potential map was evaluated based on the observed springs. The accuracy of the model was evaluated by calculating the relative operating characteristics. The area value of the relative operating characteristic curve model was found to be 0.82. These results indicate that the model is a good estimator of the spring potential in the study area. The spring potential map shows that the areas of very low, low, moderate and high groundwater spring potential classes are 105.586 km 2 (28.99%), 74.271 km 2 (19.906%), 101.203 km 2 (27.14%), and 90.05 km 2 (24.671%), respectively. The interpretations of the potential map showed that stream power index, relative permeability of lithologies, geology, elevation, aspect, wetness index, plan curvature, and drainage density play major roles in spring occurrence and distribution in the Sultan Mountains. The logistic regression approach has not yet been used to delineate groundwater potential zones. In this study, the logistic regression method was used to locate potential zones for groundwater springs in the Sultan Mountains. The evolved model
Pedrini, D. T.; Pedrini, Bonnie C.
Regression, another mechanism studied by Sigmund Freud, has had much research, e.g., hypnotic regression, frustration regression, schizophrenic regression, and infra-human-animal regression (often directly related to fixation). Many investigators worked with hypnotic age regression, which has a long history, going back to Russian reflexologists.…
Ozdemir, Adnan; Altural, Tolga
2013-03-01
This study evaluated and compared landslide susceptibility maps produced with three different methods, frequency ratio, weights of evidence, and logistic regression, by using validation datasets. The field surveys performed as part of this investigation mapped the locations of 90 landslides that had been identified in the Sultan Mountains of south-western Turkey. The landslide influence parameters used for this study are geology, relative permeability, land use/land cover, precipitation, elevation, slope, aspect, total curvature, plan curvature, profile curvature, wetness index, stream power index, sediment transportation capacity index, distance to drainage, distance to fault, drainage density, fault density, and spring density maps. The relationships between landslide distributions and these parameters were analysed using the three methods, and the results of these methods were then used to calculate the landslide susceptibility of the entire study area. The accuracy of the final landslide susceptibility maps was evaluated based on the landslides observed during the fieldwork, and the accuracy of the models was evaluated by calculating each model's relative operating characteristic curve. The predictive capability of each model was determined from the area under the relative operating characteristic curve and the areas under the curves obtained using the frequency ratio, logistic regression, and weights of evidence methods are 0.976, 0.952, and 0.937, respectively. These results indicate that the frequency ratio and weights of evidence models are relatively good estimators of landslide susceptibility in the study area. Specifically, the results of the correlation analysis show a high correlation between the frequency ratio and weights of evidence results, and the frequency ratio and logistic regression methods exhibit correlation coefficients of 0.771 and 0.727, respectively. The frequency ratio model is simple, and its input, calculation and output processes are
Toghiani, S; Aggrey, S E; Rekaya, R
2016-07-01
Availability of high-density single nucleotide polymorphism (SNP) genotyping platforms provided unprecedented opportunities to enhance breeding programmes in livestock, poultry and plant species, and to better understand the genetic basis of complex traits. Using this genomic information, genomic breeding values (GEBVs), which are more accurate than conventional breeding values. The superiority of genomic selection is possible only when high-density SNP panels are used to track genes and QTLs affecting the trait. Unfortunately, even with the continuous decrease in genotyping costs, only a small fraction of the population has been genotyped with these high-density panels. It is often the case that a larger portion of the population is genotyped with low-density and low-cost SNP panels and then imputed to a higher density. Accuracy of SNP genotype imputation tends to be high when minimum requirements are met. Nevertheless, a certain rate of genotype imputation errors is unavoidable. Thus, it is reasonable to assume that the accuracy of GEBVs will be affected by imputation errors; especially, their cumulative effects over time. To evaluate the impact of multi-generational selection on the accuracy of SNP genotypes imputation and the reliability of resulting GEBVs, a simulation was carried out under varying updating of the reference population, distance between the reference and testing sets, and the approach used for the estimation of GEBVs. Using fixed reference populations, imputation accuracy decayed by about 0.5% per generation. In fact, after 25 generations, the accuracy was only 7% lower than the first generation. When the reference population was updated by either 1% or 5% of the top animals in the previous generations, decay of imputation accuracy was substantially reduced. These results indicate that low-density panels are useful, especially when the generational interval between reference and testing population is small. As the generational interval
Subset selection in regression
Miller, Alan
2002-01-01
Originally published in 1990, the first edition of Subset Selection in Regression filled a significant gap in the literature, and its critical and popular success has continued for more than a decade. Thoroughly revised to reflect progress in theory, methods, and computing power, the second edition promises to continue that tradition. The author has thoroughly updated each chapter, incorporated new material on recent developments, and included more examples and references. New in the Second Edition:A separate chapter on Bayesian methodsComplete revision of the chapter on estimationA major example from the field of near infrared spectroscopyMore emphasis on cross-validationGreater focus on bootstrappingStochastic algorithms for finding good subsets from large numbers of predictors when an exhaustive search is not feasible Software available on the Internet for implementing many of the algorithms presentedMore examplesSubset Selection in Regression, Second Edition remains dedicated to the techniques for fitting...
Sheikh, Mashhood Ahmed; Abelsen, Birgit; Olsen, Jan Abel
2017-11-01
Previous methods for assessing mediation assume no multiplicative interactions. The inverse odds weighting (IOW) approach has been presented as a method that can be used even when interactions exist. The substantive aim of this study was to assess the indirect effect of education on health and well-being via four indicators of adult socioeconomic status (SES): income, management position, occupational hierarchy position and subjective social status. 8516 men and women from the Tromsø Study (Norway) were followed for 17 years. Education was measured at age 25-74 years, while SES and health and well-being were measured at age 42-91 years. Natural direct and indirect effects (NIE) were estimated using weighted Poisson regression models with IOW. Stata code is provided that makes it easy to assess mediation in any multiple imputed dataset with multiple mediators and interactions. Low education was associated with lower SES. Consequently, low SES was associated with being unhealthy and having a low level of well-being. The effect (NIE) of education on health and well-being is mediated by income, management position, occupational hierarchy position and subjective social status. This study contributes to the literature on mediation analysis, as well as the literature on the importance of education for health-related quality of life and subjective well-being. The influence of education on health and well-being had different pathways in this Norwegian sample. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks.
Li, YuanYuan; Parker, Lynne E
2014-01-01
Missing data is common in Wireless Sensor Networks (WSNs), especially with multi-hop communications. There are many reasons for this phenomenon, such as unstable wireless communications, synchronization issues, and unreliable sensors. Unfortunately, missing data creates a number of problems for WSNs. First, since most sensor nodes in the network are battery-powered, it is too expensive to have the nodes retransmit missing data across the network. Data re-transmission may also cause time delays when detecting abnormal changes in an environment. Furthermore, localized reasoning techniques on sensor nodes (such as machine learning algorithms to classify states of the environment) are generally not robust enough to handle missing data. Since sensor data collected by a WSN is generally correlated in time and space, we illustrate how replacing missing sensor values with spatially and temporally correlated sensor values can significantly improve the network's performance. However, our studies show that it is important to determine which nodes are spatially and temporally correlated with each other. Simple techniques based on Euclidean distance are not sufficient for complex environmental deployments. Thus, we have developed a novel Nearest Neighbor (NN) imputation method that estimates missing data in WSNs by learning spatial and temporal correlations between sensor nodes. To improve the search time, we utilize a k d-tree data structure, which is a non-parametric, data-driven binary search tree. Instead of using traditional mean and variance of each dimension for k d-tree construction, and Euclidean distance for k d-tree search, we use weighted variances and weighted Euclidean distances based on measured percentages of missing data. We have evaluated this approach through experiments on sensor data from a volcano dataset collected by a network of Crossbow motes, as well as experiments using sensor data from a highway traffic monitoring application. Our experimental
Batistatou, Evridiki; McNamee, Roseanne
2012-12-10
It is known that measurement error leads to bias in assessing exposure effects, which can however, be corrected if independent replicates are available. For expensive replicates, two-stage (2S) studies that produce data 'missing by design', may be preferred over a single-stage (1S) study, because in the second stage, measurement of replicates is restricted to a sample of first-stage subjects. Motivated by an occupational study on the acute effect of carbon black exposure on respiratory morbidity, we compare the performance of several bias-correction methods for both designs in a simulation study: an instrumental variable method (EVROS IV) based on grouping strategies, which had been recommended especially when measurement error is large, the regression calibration and the simulation extrapolation methods. For the 2S design, either the problem of 'missing' data was ignored or the 'missing' data were imputed using multiple imputations. Both in 1S and 2S designs, in the case of small or moderate measurement error, regression calibration was shown to be the preferred approach in terms of root mean square error. For 2S designs, regression calibration as implemented by Stata software is not recommended in contrast to our implementation of this method; the 'problematic' implementation of regression calibration although substantially improved with use of multiple imputations. The EVROS IV method, under a good/fairly good grouping, outperforms the regression calibration approach in both design scenarios when exposure mismeasurement is severe. Both in 1S and 2S designs with moderate or large measurement error, simulation extrapolation severely failed to correct for bias. Copyright © 2012 John Wiley & Sons, Ltd.
Directory of Open Access Journals (Sweden)
Boeschoten Laura
2017-12-01
Full Text Available Both registers and surveys can contain classification errors. These errors can be estimated by making use of a composite data set. We propose a new method based on latent class modelling to estimate the number of classification errors across several sources while taking into account impossible combinations with scores on other variables. Furthermore, the latent class model, by multiply imputing a new variable, enhances the quality of statistics based on the composite data set. The performance of this method is investigated by a simulation study, which shows that whether or not the method can be applied depends on the entropy R2 of the latent class model and the type of analysis a researcher is planning to do. Finally, the method is applied to public data from Statistics Netherlands.
Mansilha, C; Melo, A; Rebelo, H; Ferreira, I M P L V O; Pinho, O; Domingues, V; Pinho, C; Gameiro, P
2010-10-22
A multi-residue methodology based on a solid phase extraction followed by gas chromatography-tandem mass spectrometry was developed for trace analysis of 32 compounds in water matrices, including estrogens and several pesticides from different chemical families, some of them with endocrine disrupting properties. Matrix standard calibration solutions were prepared by adding known amounts of the analytes to a residue-free sample to compensate matrix-induced chromatographic response enhancement observed for certain pesticides. Validation was done mainly according to the International Conference on Harmonisation recommendations, as well as some European and American validation guidelines with specifications for pesticides analysis and/or GC-MS methodology. As the assumption of homoscedasticity was not met for analytical data, weighted least squares linear regression procedure was applied as a simple and effective way to counteract the greater influence of the greater concentrations on the fitted regression line, improving accuracy at the lower end of the calibration curve. The method was considered validated for 31 compounds after consistent evaluation of the key analytical parameters: specificity, linearity, limit of detection and quantification, range, precision, accuracy, extraction efficiency, stability and robustness. Copyright © 2010 Elsevier B.V. All rights reserved.
Machado, Fabiana Andrade; Nakamura, Fábio Yuzo; Moraes, Solange Marta Franzói De
2012-01-01
This study examined the influence of the regression model and initial intensity of an incremental test on the relationship between the lactate threshold estimated by the maximal-deviation method and the endurance performance. Sixteen non-competitive, recreational female runners performed a discontinuous incremental treadmill test. The initial speed was set at 7 km · h⁻¹, and increased every 3 min by 1 km · h⁻¹ with a 30-s rest between the stages used for earlobe capillary blood sample collection. Lactate-speed data were fitted by an exponential-plus-constant and a third-order polynomial equation. The lactate threshold was determined for both regression equations, using all the coordinates, excluding the first and excluding the first and second initial points. Mean speed of a 10-km road race was the performance index (3.04 ± 0.22 m · s⁻¹). The exponentially-derived lactate threshold had a higher correlation (0.98 ≤ r ≤ 0.99) and smaller standard error of estimate (SEE) (0.04 ≤ SEE ≤ 0.05 m · s⁻¹) with performance than the polynomially-derived equivalent (0.83 ≤ r ≤ 0.89; 0.10 ≤ SEE ≤ 0.13 m · s⁻¹). The exponential lactate threshold was greater than the polynomial equivalent (P performance index that is independent of the initial intensity of the incremental test and better than the polynomial equivalent.
Multiple imputation for estimating the risk of developing dementia and its impact on survival.
Yu, Binbing; Saczynski, Jane S; Launer, Lenore
2010-10-01
Dementia, Alzheimer's disease in particular, is one of the major causes of disability and decreased quality of life among the elderly and a leading obstacle to successful aging. Given the profound impact on public health, much research has focused on the age-specific risk of developing dementia and the impact on survival. Early work has discussed various methods of estimating age-specific incidence of dementia, among which the illness-death model is popular for modeling disease progression. In this article we use multiple imputation to fit multi-state models for survival data with interval censoring and left truncation. This approach allows semi-Markov models in which survival after dementia depends on onset age. Such models can be used to estimate the cumulative risk of developing dementia in the presence of the competing risk of dementia-free death. Simulations are carried out to examine the performance of the proposed method. Data from the Honolulu Asia Aging Study are analyzed to estimate the age-specific and cumulative risks of dementia and to examine the effect of major risk factors on dementia onset and death.
Yulia, M.; Suhandy, D.
2018-03-01
NIR spectra obtained from spectral data acquisition system contains both chemical information of samples as well as physical information of the samples, such as particle size and bulk density. Several methods have been established for developing calibration models that can compensate for sample physical information variations. One common approach is to include physical information variation in the calibration model both explicitly and implicitly. The objective of this study was to evaluate the feasibility of using explicit method to compensate the influence of different particle size of coffee powder in NIR calibration model performance. A number of 220 coffee powder samples with two different types of coffee (civet and non-civet) and two different particle sizes (212 and 500 µm) were prepared. Spectral data was acquired using NIR spectrometer equipped with an integrating sphere for diffuse reflectance measurement. A discrimination method based on PLS-DA was conducted and the influence of different particle size on the performance of PLS-DA was investigated. In explicit method, we add directly the particle size as predicted variable results in an X block containing only the NIR spectra and a Y block containing the particle size and type of coffee. The explicit inclusion of the particle size into the calibration model is expected to improve the accuracy of type of coffee determination. The result shows that using explicit method the quality of the developed calibration model for type of coffee determination is a little bit superior with coefficient of determination (R2) = 0.99 and root mean square error of cross-validation (RMSECV) = 0.041. The performance of the PLS2 calibration model for type of coffee determination with particle size compensation was quite good and able to predict the type of coffee in two different particle sizes with relatively high R2 pred values. The prediction also resulted in low bias and RMSEP values.
Steganalysis using logistic regression
Lubenko, Ivans; Ker, Andrew D.
2011-02-01
We advocate Logistic Regression (LR) as an alternative to the Support Vector Machine (SVM) classifiers commonly used in steganalysis. LR offers more information than traditional SVM methods - it estimates class probabilities as well as providing a simple classification - and can be adapted more easily and efficiently for multiclass problems. Like SVM, LR can be kernelised for nonlinear classification, and it shows comparable classification accuracy to SVM methods. This work is a case study, comparing accuracy and speed of SVM and LR classifiers in detection of LSB Matching and other related spatial-domain image steganography, through the state-of-art 686-dimensional SPAM feature set, in three image sets.
Santos-Concejero, Jordan; Tucker, Ross; Granados, Cristina; Irazusta, Jon; Bidaurrazaga-Letona, Iraia; Zabala-Lili, Jon; Gil, Susana María
2014-01-01
This study investigated the influence of the regression model and initial intensity during an incremental test on the relationship between the lactate threshold estimated by the maximal-deviation method and performance in elite-standard runners. Twenty-three well-trained runners completed a discontinuous incremental running test on a treadmill. Speed started at 9 km · h(-1) and increased by 1.5 km · h(-1) every 4 min until exhaustion, with a minute of recovery for blood collection. Lactate-speed data were fitted by exponential and polynomial models. The lactate threshold was determined for both models, using all the co-ordinates, excluding the first and excluding the first and second points. The exponential lactate threshold was greater than the polynomial equivalent in any co-ordinate condition (P performance and is independent of the initial intensity of the test.
Hwang, Jae Joon; Kim, Kee-Deog; Park, Hyok; Park, Chang Seo; Jeong, Ho-Gul
2014-01-01
Superimposition has been used as a method to evaluate the changes of orthodontic or orthopedic treatment in the dental field. With the introduction of cone beam CT (CBCT), evaluating 3 dimensional changes after treatment became possible by superimposition. 4 point plane orientation is one of the simplest ways to achieve superimposition of 3 dimensional images. To find factors influencing superimposition error of cephalometric landmarks by 4 point plane orientation method and to evaluate the reproducibility of cephalometric landmarks for analyzing superimposition error, 20 patients were analyzed who had normal skeletal and occlusal relationship and took CBCT for diagnosis of temporomandibular disorder. The nasion, sella turcica, basion and midpoint between the left and the right most posterior point of the lesser wing of sphenoidal bone were used to define a three-dimensional (3D) anatomical reference co-ordinate system. Another 15 reference cephalometric points were also determined three times in the same image. Reorientation error of each landmark could be explained substantially (23%) by linear regression model, which consists of 3 factors describing position of each landmark towards reference axes and locating error. 4 point plane orientation system may produce an amount of reorientation error that may vary according to the perpendicular distance between the landmark and the x-axis; the reorientation error also increases as the locating error and shift of reference axes viewed from each landmark increases. Therefore, in order to reduce the reorientation error, accuracy of all landmarks including the reference points is important. Construction of the regression model using reference points of greater precision is required for the clinical application of this model.
Huybrechts, Inge; Lioret, Sandrine; Mouratidou, Theodora; Gunter, Marc J; Manios, Yannis; Kersting, Mathilde; Gottrand, Frederic; Kafatos, Anthony; De Henauw, Stefaan; Cuenca-García, Magdalena; Widhalm, Kurt; Gonzales-Gross, Marcela; Molnar, Denes; Moreno, Luis A; McNaughton, Sarah A
2017-01-01
This study aims to examine repeatability of reduced rank regression (RRR) methods in calculating dietary patterns (DP) and cross-sectional associations with overweight (OW)/obesity across European and Australian samples of adolescents. Data from two cross-sectional surveys in Europe (2006/2007 Healthy Lifestyle in Europe by Nutrition in Adolescence study, including 1954 adolescents, 12-17 years) and Australia (2007 National Children's Nutrition and Physical Activity Survey, including 1498 adolescents, 12-16 years) were used. Dietary intake was measured using two non-consecutive, 24-h recalls. RRR was used to identify DP using dietary energy density, fibre density and percentage of energy intake from fat as the intermediate variables. Associations between DP scores and body mass/fat were examined using multivariable linear and logistic regression as appropriate, stratified by sex. The first DP extracted (labelled 'energy dense, high fat, low fibre') explained 47 and 31 % of the response variation in Australian and European adolescents, respectively. It was similar for European and Australian adolescents and characterised by higher consumption of biscuits/cakes, chocolate/confectionery, crisps/savoury snacks, sugar-sweetened beverages, and lower consumption of yogurt, high-fibre bread, vegetables and fresh fruit. DP scores were inversely associated with BMI z-scores in Australian adolescent boys and borderline inverse in European adolescent boys (so as with %BF). Similarly, a lower likelihood for OW in boys was observed with higher DP scores in both surveys. No such relationships were observed in adolescent girls. In conclusion, the DP identified in this cross-country study was comparable for European and Australian adolescents, demonstrating robustness of the RRR method in calculating DP among populations. However, longitudinal designs are more relevant when studying diet-obesity associations, to prevent reverse causality.
Seyedmahmoud, Rasoul
2014-04-07
This two-articles series presents an in-depth discussion of electrospun poly-l-lactide scaffolds for tissue engineering by means of statistical methodologies that can be used, in general, to gain a quantitative and systematic insight about effects and interactions between a handful of key scaffold properties (Ys) and a set of process parameters (Xs) in electrospinning. While Part-1 dealt with the DOE methods to unveil the interactions between Xs in determining the morphomechanical properties (ref. Y1-4), this Part-2 article continues and refocuses the discussion on the interdependence of scaffold properties investigated by standard regression methods. The discussion first explores the connection between mechanical properties (Y4) and morphological descriptors of the scaffolds (Y1-3) in 32 types of scaffolds, finding that the mean fiber diameter (Y1) plays a predominant role which is nonetheless and crucially modulated by the molecular weight (MW) of PLLA. The second part examines the biological performance (Y5) (i.e. the cell proliferation of seeded bone marrow-derived mesenchymal stromal cells) on a random subset of eight scaffolds vs. the mechanomorphological properties (Y1-4). In this case, the featured regression analysis on such an incomplete set was not conclusive, though, indirectly suggesting in quantitative terms that cell proliferation could not fully be explained as a function of considered mechanomorphological properties (Y1-4), but in the early stage seeding, and that a randomization effects occurs over time such that the differences in initial cell proliferation performance (at day 1) is smeared over time. The findings may be the cornerstone of a novel route to accrue sufficient understanding and establish design rules for scaffold biofunctional vs. architecture, mechanical properties, and process parameters.
Ali, M Sanni; Groenwold, Rolf H H; Belitser, Svetlana V; Souverein, Patrick C; Martín, Elisa; Gatto, Nicolle M; Huerta, Consuelo; Gardarsdottir, Helga; Roes, Kit C B; Hoes, Arno W; de Boer, Antonius; Klungel, Olaf H
2016-03-01
Observational studies including time-varying treatments are prone to confounding. We compared time-varying Cox regression analysis, propensity score (PS) methods, and marginal structural models (MSMs) in a study of antidepressant [selective serotonin reuptake inhibitors (SSRIs)] use and the risk of hip fracture. A cohort of patients with a first prescription for antidepressants (SSRI or tricyclic antidepressants) was extracted from the Dutch Mondriaan and Spanish Base de datos para la Investigación Farmacoepidemiológica en Atención Primaria (BIFAP) general practice databases for the period 2001-2009. The net (total) effect of SSRI versus no SSRI on the risk of hip fracture was estimated using time-varying Cox regression, stratification and covariate adjustment using the PS, and MSM. In MSM, censoring was accounted for by inverse probability of censoring weights. The crude hazard ratio (HR) of SSRI use versus no SSRI use on hip fracture was 1.75 (95%CI: 1.12, 2.72) in Mondriaan and 2.09 (1.89, 2.32) in BIFAP. After confounding adjustment using time-varying Cox regression, stratification, and covariate adjustment using the PS, HRs increased in Mondriaan [2.59 (1.63, 4.12), 2.64 (1.63, 4.25), and 2.82 (1.63, 4.25), respectively] and decreased in BIFAP [1.56 (1.40, 1.73), 1.54 (1.39, 1.71), and 1.61 (1.45, 1.78), respectively]. MSMs with stabilized weights yielded HR 2.15 (1.30, 3.55) in Mondriaan and 1.63 (1.28, 2.07) in BIFAP when accounting for censoring and 2.13 (1.32, 3.45) in Mondriaan and 1.66 (1.30, 2.12) in BIFAP without accounting for censoring. In this empirical study, differences between the different methods to control for time-dependent confounding were small. The observed differences in treatment effect estimates between the databases are likely attributable to different confounding information in the datasets, illustrating that adequate information on (time-varying) confounding is crucial to prevent bias. Copyright © 2016 John Wiley & Sons, Ltd.
PRIMAL: Fast and accurate pedigree-based imputation from sequence data in a founder population.
Directory of Open Access Journals (Sweden)
Oren E Livne
2015-03-01
Full Text Available Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm, a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs, from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.
Advanced statistics: linear regression, part I: simple linear regression.
Marill, Keith A
2004-01-01
Simple linear regression is a mathematical technique used to model the relationship between a single independent predictor variable and a single dependent outcome variable. In this, the first of a two-part series exploring concepts in linear regression analysis, the four fundamental assumptions and the mechanics of simple linear regression are reviewed. The most common technique used to derive the regression line, the method of least squares, is described. The reader will be acquainted with other important concepts in simple linear regression, including: variable transformations, dummy variables, relationship to inference testing, and leverage. Simplified clinical examples with small datasets and graphic models are used to illustrate the points. This will provide a foundation for the second article in this series: a discussion of multiple linear regression, in which there are multiple predictor variables.
International Nuclear Information System (INIS)
Yuan, Haibo; Liu, Xiaowei; Xiang, Maosheng; Huang, Yang; Zhang, Huihua; Chen, Bingqiu
2015-01-01
In this paper we propose a spectroscopy-based stellar color regression (SCR) method to perform accurate color calibration for modern imaging surveys, taking advantage of millions of stellar spectra now available. The method is straightforward, insensitive to systematic errors in the spectroscopically determined stellar atmospheric parameters, applicable to regions that are effectively covered by spectroscopic surveys, and capable of delivering an accuracy of a few millimagnitudes for color calibration. As an illustration, we have applied the method to the Sloan Digital Sky Survey (SDSS) Stripe 82 data. With a total number of 23,759 spectroscopically targeted stars, we have mapped out the small but strongly correlated color zero-point errors present in the photometric catalog of Stripe 82, and we improve the color calibration by a factor of two to three. Our study also reveals some small but significant magnitude dependence errors in the z band for some charge-coupled devices (CCDs). Such errors are likely to be present in all the SDSS photometric data. Our results are compared with those from a completely independent test based on the intrinsic colors of red galaxies presented by Ivezić et al. The comparison, as well as other tests, shows that the SCR method has achieved a color calibration internally consistent at a level of about 5 mmag in u – g, 3 mmag in g – r, and 2 mmag in r – i and i – z. Given the power of the SCR method, we discuss briefly the potential benefits by applying the method to existing, ongoing, and upcoming imaging surveys
Directory of Open Access Journals (Sweden)
JEMMAH A I
2018-01-01
Full Text Available Taounate region is known by a high density of mass movements which cause several human and economic losses. The goal of this paper is to assess the landslide susceptibility of Taounate using the Weight of Evidence method (WofE and the Logistic Regression method (LR. Seven conditioning factors were used in this study: lithology, fault, drainage, slope, elevation, exposure and land use. Over the years, this site and its surroundings have experienced repeated landslides. For this reason, landslide susceptibility mapping is mandatory for risk prevention and land-use management. In this study, we have focused on recent large-scale mass movements. Finally, the ROC curves were established to evaluate the degree of fit of the model and to choose the best landslide susceptibility zonation. A total mass movements location were detected; 50% were randomly selected as input data for the entire process using the Spatial Data Model (SDM and the remaining locations were used for validation purposes. The obtained WofE’s landslide susceptibility map shows that high to very high susceptibility zones contain 62% of the total of inventoried landslides, while the same zones contain only 47% of landslides in the map obtained by the LR method. This landslide susceptibility map obtained is a major contribution to various urban and regional development plans under the Taounate Region National Development Program.
Fungible weights in logistic regression.
Jones, Jeff A; Waller, Niels G
2016-06-01
In this article we develop methods for assessing parameter sensitivity in logistic regression models. To set the stage for this work, we first review Waller's (2008) equations for computing fungible weights in linear regression. Next, we describe 2 methods for computing fungible weights in logistic regression. To demonstrate the utility of these methods, we compute fungible logistic regression weights using data from the Centers for Disease Control and Prevention's (2010) Youth Risk Behavior Surveillance Survey, and we illustrate how these alternate weights can be used to evaluate parameter sensitivity. To make our work accessible to the research community, we provide R code (R Core Team, 2015) that will generate both kinds of fungible logistic regression weights. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Imputation-based analysis of association studies: candidate regions and quantitative traits.
Directory of Open Access Journals (Sweden)
Bertrand Servin
2007-07-01
Full Text Available We introduce a new framework for the analysis of association studies, designed to allow untyped variants to be more effectively and directly tested for association with a phenotype. The idea is to combine knowledge on patterns of correlation among SNPs (e.g., from the International HapMap project or resequencing data in a candidate region of interest with genotype data at tag SNPs collected on a phenotyped study sample, to estimate ("impute" unmeasured genotypes, and then assess association between the phenotype and these estimated genotypes. Compared with standard single-SNP tests, this approach results in increased power to detect association, even in cases in which the causal variant is typed, with the greatest gain occurring when multiple causal variants are present. It also provides more interpretable explanations for observed associations, including assessing, for each SNP, the strength of the evidence that it (rather than another correlated SNP is causal. Although we focus on association studies with quantitative phenotype and a relatively restricted region (e.g., a candidate gene, the framework is applicable and computationally practical for whole genome association studies. Methods described here are implemented in a software package, Bim-Bam, available from the Stephens Lab website http://stephenslab.uchicago.edu/software.html.
Testing discontinuities in nonparametric regression
Dai, Wenlin
2017-01-19
In nonparametric regression, it is often needed to detect whether there are jump discontinuities in the mean function. In this paper, we revisit the difference-based method in [13 H.-G. Müller and U. Stadtmüller, Discontinuous versus smooth regression, Ann. Stat. 27 (1999), pp. 299–337. doi: 10.1214/aos/1018031100
Testing discontinuities in nonparametric regression
Dai, Wenlin; Zhou, Yuejin; Tong, Tiejun
2017-01-01
In nonparametric regression, it is often needed to detect whether there are jump discontinuities in the mean function. In this paper, we revisit the difference-based method in [13 H.-G. Müller and U. Stadtmüller, Discontinuous versus smooth regression, Ann. Stat. 27 (1999), pp. 299–337. doi: 10.1214/aos/1018031100
Quantile regression theory and applications
Davino, Cristina; Vistocco, Domenico
2013-01-01
A guide to the implementation and interpretation of Quantile Regression models This book explores the theory and numerous applications of quantile regression, offering empirical data analysis as well as the software tools to implement the methods. The main focus of this book is to provide the reader with a comprehensivedescription of the main issues concerning quantile regression; these include basic modeling, geometrical interpretation, estimation and inference for quantile regression, as well as issues on validity of the model, diagnostic tools. Each methodological aspect is explored and
Johnson, Eric O; Hancock, Dana B; Levy, Joshua L; Gaddis, Nathan C; Saccone, Nancy L; Bierut, Laura J; Page, Grier P
2013-05-01
A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.
Hilbe, Joseph M
2009-01-01
This book really does cover everything you ever wanted to know about logistic regression … with updates available on the author's website. Hilbe, a former national athletics champion, philosopher, and expert in astronomy, is a master at explaining statistical concepts and methods. Readers familiar with his other expository work will know what to expect-great clarity.The book provides considerable detail about all facets of logistic regression. No step of an argument is omitted so that the book will meet the needs of the reader who likes to see everything spelt out, while a person familiar with some of the topics has the option to skip "obvious" sections. The material has been thoroughly road-tested through classroom and web-based teaching. … The focus is on helping the reader to learn and understand logistic regression. The audience is not just students meeting the topic for the first time, but also experienced users. I believe the book really does meet the author's goal … .-Annette J. Dobson, Biometric...
Linear regression in astronomy. II
Feigelson, Eric D.; Babu, Gutti J.
1992-01-01
A wide variety of least-squares linear regression procedures used in observational astronomy, particularly investigations of the cosmic distance scale, are presented and discussed. The classes of linear models considered are (1) unweighted regression lines, with bootstrap and jackknife resampling; (2) regression solutions when measurement error, in one or both variables, dominates the scatter; (3) methods to apply a calibration line to new data; (4) truncated regression models, which apply to flux-limited data sets; and (5) censored regression models, which apply when nondetections are present. For the calibration problem we develop two new procedures: a formula for the intercept offset between two parallel data sets, which propagates slope errors from one regression to the other; and a generalization of the Working-Hotelling confidence bands to nonstandard least-squares lines. They can provide improved error analysis for Faber-Jackson, Tully-Fisher, and similar cosmic distance scale relations.
Time-adaptive quantile regression
DEFF Research Database (Denmark)
Møller, Jan Kloppenborg; Nielsen, Henrik Aalborg; Madsen, Henrik
2008-01-01
and an updating procedure are combined into a new algorithm for time-adaptive quantile regression, which generates new solutions on the basis of the old solution, leading to savings in computation time. The suggested algorithm is tested against a static quantile regression model on a data set with wind power......An algorithm for time-adaptive quantile regression is presented. The algorithm is based on the simplex algorithm, and the linear optimization formulation of the quantile regression problem is given. The observations have been split to allow a direct use of the simplex algorithm. The simplex method...... production, where the models combine splines and quantile regression. The comparison indicates superior performance for the time-adaptive quantile regression in all the performance parameters considered....
On multivariate imputation and forecasting of decadal wind speed missing data.
Wesonga, Ronald
2015-01-01
This paper demonstrates the application of multiple imputations by chained equations and time series forecasting of wind speed data. The study was motivated by the high prevalence of missing wind speed historic data. Findings based on the fully conditional specification under multiple imputations by chained equations, provided reliable wind speed missing data imputations. Further, the forecasting model shows, the smoothing parameter, alpha (0.014) close to zero, confirming that recent past observations are more suitable for use to forecast wind speeds. The maximum decadal wind speed for Entebbe International Airport was estimated to be 17.6 metres per second at a 0.05 level of significance with a bound on the error of estimation of 10.8 metres per second. The large bound on the error of estimations confirms the dynamic tendencies of wind speed at the airport under study.
SEPARATION PHENOMENA LOGISTIC REGRESSION
Directory of Open Access Journals (Sweden)
Ikaro Daniel de Carvalho Barreto
2014-03-01
Full Text Available This paper proposes an application of concepts about the maximum likelihood estimation of the binomial logistic regression model to the separation phenomena. It generates bias in the estimation and provides different interpretations of the estimates on the different statistical tests (Wald, Likelihood Ratio and Score and provides different estimates on the different iterative methods (Newton-Raphson and Fisher Score. It also presents an example that demonstrates the direct implications for the validation of the model and validation of variables, the implications for estimates of odds ratios and confidence intervals, generated from the Wald statistics. Furthermore, we present, briefly, the Firth correction to circumvent the phenomena of separation.
Directory of Open Access Journals (Sweden)
Yi Liang
2016-11-01
Full Text Available The power industry is the main battlefield of CO2 emission reduction, which plays an important role in the implementation and development of the low carbon economy. The forecasting of electricity demand can provide a scientific basis for the country to formulate a power industry development strategy and further promote the sustained, healthy and rapid development of the national economy. Under the goal of low-carbon economy, medium and long term electricity demand forecasting will have very important practical significance. In this paper, a new hybrid electricity demand model framework is characterized as follows: firstly, integration of grey relation degree (GRD with induced ordered weighted harmonic averaging operator (IOWHA to propose a new weight determination method of hybrid forecasting model on basis of forecasting accuracy as induced variables is presented; secondly, utilization of the proposed weight determination method to construct the optimal hybrid forecasting model based on extreme learning machine (ELM forecasting model and multiple regression (MR model; thirdly, three scenarios in line with the level of realization of various carbon emission targets and dynamic simulation of effect of low-carbon economy on future electricity demand are discussed. The resulting findings show that, the proposed model outperformed and concentrated some monomial forecasting models, especially in boosting the overall instability dramatically. In addition, the development of a low-carbon economy will increase the demand for electricity, and have an impact on the adjustment of the electricity demand structure.
International Nuclear Information System (INIS)
Leng Ling; Zhang Tianyi; Kleinman, Lawrence; Zhu Wei
2007-01-01
Regression analysis, especially the ordinary least squares method which assumes that errors are confined to the dependent variable, has seen a fair share of its applications in aerosol science. The ordinary least squares approach, however, could be problematic due to the fact that atmospheric data often does not lend itself to calling one variable independent and the other dependent. Errors often exist for both measurements. In this work, we examine two regression approaches available to accommodate this situation. They are orthogonal regression and geometric mean regression. Comparisons are made theoretically as well as numerically through an aerosol study examining whether the ratio of organic aerosol to CO would change with age
DEFF Research Database (Denmark)
Johansen, Søren
2008-01-01
The reduced rank regression model is a multivariate regression model with a coefficient matrix with reduced rank. The reduced rank regression algorithm is an estimation procedure, which estimates the reduced rank regression model. It is related to canonical correlations and involves calculating...
Puzo, Quirino; Qin, Ping; Mehlum, Lars
2016-03-11
Suicide mortality and the rates by specific methods in a population may change over time in response to concurrent changes in relevant factors in society. This study aimed to identify significant changing points in method-specific suicide mortality from 1969 to 2012 in Norway. Data on suicide mortality by specific methods and by sex and age were retrieved from the Norwegian Cause-of-Death Register. Long-term trends in age-standardized rates of suicide mortality were analyzed by using joinpoint regression analysis. The most frequently used suicide method in the total population was hanging, followed by poisoning and firearms. Men chose suicide by firearms more often than women, whereas poisoning and drowning were more frequently used by women. The joinpoint analysis revealed that the overall trend of suicide mortality significantly changed twice along the period of 1969 to 2012 for both sexes. The male age-standardized suicide rate increased by 3.1% per year until 1989, and decreased by 1.2% per year between 1994 and 2012. Among females the long-term suicide rate increased by 4.0% per year until 1988, decreased by 5.5% through 1995, and then stabilized. Both sexes experienced an upward trend for suicide by hanging during the 44-year observation period, with a particularly significant increase in 15-24 year old males. The most distinct change among men was seen for firearms after 1988 with a significant decrease through 2012 of around 5% per year. For women, significant reductions since 1985-88 were observed for suicide by drowning and poisoning. The present study demonstrates different time trends for different suicide methods with significant reductions in suicide by firearms, drowning and poisoning after the peak in the suicide rate in the late 1980s. Suicide by means of hanging continuously increased, but did not fully compensate for the reduced use of other methods. This lends some support for the effectiveness of method-specific suicide preventive measures
Analyzing the changing gender wage gap based on multiply imputed right censored wages
Gartner, Hermann; Rässler, Susanne
2005-01-01
"In order to analyze the gender wage gap with the German IAB-employment register we have to solve the problem of censored wages at the upper limit of the social security system. We treat this problem as a missing data problem. We regard the missingness mechanism as not missing at random (NMAR, according to Little and Rubin, 1987, 2002) as well as missing by design. The censored wages are multiply imputed by draws of a random variable from a truncated distribution. The multiple imputation is b...
Quantile Regression With Measurement Error
Wei, Ying; Carroll, Raymond J.
2009-01-01
. The finite sample performance of the proposed method is investigated in a simulation study, and compared to the standard regression calibration approach. Finally, we apply our methodology to part of the National Collaborative Perinatal Project growth data, a
Ochoa Gutierrez, L. H.; Vargas Jimenez, C. A.; Niño Vasquez, L. F.
2011-12-01
The "Sabana de Bogota" (Bogota Savannah) is the most important social and economical center of Colombia. Almost the third of population is concentrated in this region and generates about the 40% of Colombia's Internal Brute Product (IBP). According to this, the zone presents an elevated vulnerability in case that a high destructive seismic event occurs. Historical evidences show that high magnitude events took place in the past with a huge damage caused to the city and indicate that is probable that such events can occur in the next years. This is the reason why we are working in an early warning generation system, using the first few seconds of a seismic signal registered by three components and wide band seismometers. Such system can be implemented using Computational Intelligence tools, designed and calibrated to the particular Geological, Structural and environmental conditions present in the region. The methods developed are expected to work on real time, thus suitable software and electronic tools need to be developed. We used Support Vector Machines Regression (SVMR) methods trained and tested with historic seismic events registered by "EL ROSAL" Station, located near Bogotá, calculating descriptors or attributes as the input of the model, from the first 6 seconds of signal. With this algorithm, we obtained less than 10% of mean absolute error and correlation coefficients greater than 85% in hypocentral distance and Magnitude estimation. With this results we consider that we can improve the method trying to have better accuracy with less signal time and that this can be a very useful model to be implemented directly in the seismological stations to generate a fast characterization of the event, broadcasting not only raw signal but pre-processed information that can be very useful for accurate Early Warning Generation.
Wang, Chaolong; Zhan, Xiaowei; Liang, Liming; Abecasis, Gonçalo R.; Lin, Xihong
2015-01-01
Accurate estimation of individual ancestry is important in genetic association studies, especially when a large number of samples are collected from multiple sources. However, existing approaches developed for genome-wide SNP data do not work well with modest amounts of genetic data, such as in targeted sequencing or exome chip genotyping experiments. We propose a statistical framework to estimate individual ancestry in a principal component ancestry map generated by a reference set of individuals. This framework extends and improves upon our previous method for estimating ancestry using low-coverage sequence reads (LASER 1.0) to analyze either genotyping or sequencing data. In particular, we introduce a projection Procrustes analysis approach that uses high-dimensional principal components to estimate ancestry in a low-dimensional reference space. Using extensive simulations and empirical data examples, we show that our new method (LASER 2.0), combined with genotype imputation on the reference individuals, can substantially outperform LASER 1.0 in estimating fine-scale genetic ancestry. Specifically, LASER 2.0 can accurately estimate fine-scale ancestry within Europe using either exome chip genotypes or targeted sequencing data with off-target coverage as low as 0.05×. Under the framework of LASER 2.0, we can estimate individual ancestry in a shared reference space for samples assayed at different loci or by different techniques. Therefore, our ancestry estimation method will accelerate discovery in disease association studies not only by helping model ancestry within individual studies but also by facilitating combined analysis of genetic data from multiple sources. PMID:26027497
Aszyk, Justyna; Kot, Jacek; Tkachenko, Yurii; Woźniak, Michał; Bogucka-Kocka, Anna; Kot-Wasik, Agata
2017-04-15
A simple, fast, sensitive and accurate methodology based on a LLE followed by liquid chromatography-tandem mass spectrometry for simultaneous determination of four regioisomers (8-iso prostaglandin F 2α , 8-iso-15(R)-prostaglandin F 2α , 11β-prostaglandin F 2α , 15(R)-prostaglandin F 2α ) in routine analysis of human plasma samples was developed. Isoprostanes are stable products of arachidonic acid peroxidation and are regarded as the most reliable markers of oxidative stress in vivo. Validation of method was performed by evaluation of the key analytical parameters such as: matrix effect, analytical curve, trueness, precision, limits of detection and limits of quantification. As a homoscedasticity was not met for analytical data, weighted linear regression was applied in order to improve the accuracy at the lower end points of calibration curve. The detection limits (LODs) ranged from 1.0 to 2.1pg/mL. For plasma samples spiked with the isoprostanes at the level of 50pg/mL, intra-and interday repeatability ranged from 2.1 to 3.5% and 0.1 to 5.1%, respectively. The applicability of the proposed approach has been verified by monitoring of isoprostane isomers level in plasma samples collected from young patients (n=8) subjected to hyperbaric hyperoxia (100% oxygen at 280kPa(a) for 30min) in a multiplace hyperbaric chamber. Copyright © 2017 Elsevier B.V. All rights reserved.
Regression analysis with categorized regression calibrated exposure: some interesting findings
Directory of Open Access Journals (Sweden)
Hjartåker Anette
2006-07-01
Full Text Available Abstract Background Regression calibration as a method for handling measurement error is becoming increasingly well-known and used in epidemiologic research. However, the standard version of the method is not appropriate for exposure analyzed on a categorical (e.g. quintile scale, an approach commonly used in epidemiologic studies. A tempting solution could then be to use the predicted continuous exposure obtained through the regression calibration method and treat it as an approximation to the true exposure, that is, include the categorized calibrated exposure in the main regression analysis. Methods We use semi-analytical calculations and simulations to evaluate the performance of the proposed approach compared to the naive approach of not correcting for measurement error, in situations where analyses are performed on quintile scale and when incorporating the original scale into the categorical variables, respectively. We also present analyses of real data, containing measures of folate intake and depression, from the Norwegian Women and Cancer study (NOWAC. Results In cases where extra information is available through replicated measurements and not validation data, regression calibration does not maintain important qualities of the true exposure distribution, thus estimates of variance and percentiles can be severely biased. We show that the outlined approach maintains much, in some cases all, of the misclassification found in the observed exposure. For that reason, regression analysis with the corrected variable included on a categorical scale is still biased. In some cases the corrected estimates are analytically equal to those obtained by the naive approach. Regression calibration is however vastly superior to the naive method when applying the medians of each category in the analysis. Conclusion Regression calibration in its most well-known form is not appropriate for measurement error correction when the exposure is analyzed on a
Directory of Open Access Journals (Sweden)
Guanghao Sun
2016-11-01
Full Text Available Background and Objectives: Heart rate variability (HRV has been intensively studied as a promising biological marker of major depressive disorder (MDD. Our previous study confirmed that autonomic activity and reactivity in depression revealed by HRV during rest and mental task (MT conditions can be used as diagnostic measures and in clinical evaluation. In this study, logistic regression analysis (LRA was utilized for the classification and prediction of MDD based on HRV data obtained in an MT paradigm.Methods: Power spectral analysis of HRV on R-R intervals before, during, and after an MT (random number generation was performed in 44 drug-naïve patients with MDD and 47 healthy control subjects at Department of Psychiatry in Shizuoka Saiseikai General Hospital. Logit scores of LRA determined by HRV indices and heart rates discriminated patients with MDD from healthy subjects. The high frequency (HF component of HRV and the ratio of the low frequency (LF component to the HF component (LF/HF correspond to parasympathetic and sympathovagal balance, respectively.Results: The LRA achieved a sensitivity and specificity of 80.0% and 79.0%, respectively, at an optimum cutoff logit score (0.28. Misclassifications occurred only when the logit score was close to the cutoff score. Logit scores also correlated significantly with subjective self-rating depression scale scores (p < 0.05.Conclusion: HRV indices recorded during a mental task may be an objective tool for screening patients with MDD in psychiatric practice. The proposed method appears promising for not only objective and rapid MDD screening, but also evaluation of its severity.
Vastaranta, Mikko; Kankare, Ville; Holopainen, Markus; Yu, Xiaowei; Hyyppä, Juha; Hyyppä, Hannu
2012-01-01
The two main approaches to deriving forest variables from laser-scanning data are the statistical area-based approach (ABA) and individual tree detection (ITD). With ITD it is feasible to acquire single tree information, as in field measurements. Here, ITD was used for measuring training data for the ABA. In addition to automatic ITD (ITD auto), we tested a combination of ITD auto and visual interpretation (ITD visual). ITD visual had two stages: in the first, ITD auto was carried out and in the second, the results of the ITD auto were visually corrected by interpreting three-dimensional laser point clouds. The field data comprised 509 circular plots ( r = 10 m) that were divided equally for testing and training. ITD-derived forest variables were used for training the ABA and the accuracies of the k-most similar neighbor ( k-MSN) imputations were evaluated and compared with the ABA trained with traditional measurements. The root-mean-squared error (RMSE) in the mean volume was 24.8%, 25.9%, and 27.2% with the ABA trained with field measurements, ITD auto, and ITD visual, respectively. When ITD methods were applied in acquiring training data, the mean volume, basal area, and basal area-weighted mean diameter were underestimated in the ABA by 2.7-9.2%. This project constituted a pilot study for using ITD measurements as training data for the ABA. Further studies are needed to reduce the bias and to determine the accuracy obtained in imputation of species-specific variables. The method could be applied in areas with sparse road networks or when the costs of fieldwork must be minimized.
Directory of Open Access Journals (Sweden)
Jörgen Wallerman
2013-04-01
Full Text Available Individual tree crowns may be delineated from airborne laser scanning (ALS data by segmentation of surface models or by 3D analysis. Segmentation of surface models benefits from using a priori knowledge about the proportions of tree crowns, which has not yet been utilized for 3D analysis to any great extent. In this study, an existing surface segmentation method was used as a basis for a new tree model 3D clustering method applied to ALS returns in 104 circular field plots with 12 m radius in pine-dominated boreal forest (64°14'N, 19°50'E. For each cluster below the tallest canopy layer, a parabolic surface was fitted to model a tree crown. The tree model clustering identified more trees than segmentation of the surface model, especially smaller trees below the tallest canopy layer. Stem attributes were estimated with k-Most Similar Neighbours (k-MSN imputation of the clusters based on field-measured trees. The accuracy at plot level from the k-MSN imputation (stem density root mean square error or RMSE 32.7%; stem volume RMSE 28.3% was similar to the corresponding results from the surface model (stem density RMSE 33.6%; stem volume RMSE 26.1% with leave-one-out cross-validation for one field plot at a time. Three-dimensional analysis of ALS data should also be evaluated in multi-layered forests since it identified a larger number of small trees below the tallest canopy layer.
Korany, Mohamed A; Gazy, Azza A; Khamis, Essam F; Ragab, Marwa A A; Kamal, Miranda F
2018-03-26
This study outlines two robust regression approaches, namely least median of squares (LMS) and iteratively re-weighted least squares (IRLS) to investigate their application in instrument analysis of nutraceuticals (that is, fluorescence quenching of merbromin reagent upon lipoic acid addition). These robust regression methods were used to calculate calibration data from the fluorescence quenching reaction (∆F and F-ratio) under ideal or non-ideal linearity conditions. For each condition, data were treated using three regression fittings: Ordinary Least Squares (OLS), LMS and IRLS. Assessment of linearity, limits of detection (LOD) and quantitation (LOQ), accuracy and precision were carefully studied for each condition. LMS and IRLS regression line fittings showed significant improvement in correlation coefficients and all regression parameters for both methods and both conditions. In the ideal linearity condition, the intercept and slope changed insignificantly, but a dramatic change was observed for the non-ideal condition and linearity intercept. Under both linearity conditions, LOD and LOQ values after the robust regression line fitting of data were lower than those obtained before data treatment. The results obtained after statistical treatment indicated that the linearity ranges for drug determination could be expanded to lower limits of quantitation by enhancing the regression equation parameters after data treatment. Analysis results for lipoic acid in capsules, using both fluorimetric methods, treated by parametric OLS and after treatment by robust LMS and IRLS were compared for both linearity conditions. Copyright © 2018 John Wiley & Sons, Ltd.
Applying an efficient K-nearest neighbor search to forest attribute imputation
Andrew O. Finley; Ronald E. McRoberts; Alan R. Ek
2006-01-01
This paper explores the utility of an efficient nearest neighbor (NN) search algorithm for applications in multi-source kNN forest attribute imputation. The search algorithm reduces the number of distance calculations between a given target vector and each reference vector, thereby, decreasing the time needed to discover the NN subset. Results of five trials show gains...
Limitations in Using Multiple Imputation to Harmonize Individual Participant Data for Meta-Analysis.
Siddique, Juned; de Chavez, Peter J; Howe, George; Cruden, Gracelyn; Brown, C Hendricks
2018-02-01
Individual participant data (IPD) meta-analysis is a meta-analysis in which the individual-level data for each study are obtained and used for synthesis. A common challenge in IPD meta-analysis is when variables of interest are measured differently in different studies. The term harmonization has been coined to describe the procedure of placing variables on the same scale in order to permit pooling of data from a large number of studies. Using data from an IPD meta-analysis of 19 adolescent depression trials, we describe a multiple imputation approach for harmonizing 10 depression measures across the 19 trials by treating those depression measures that were not used in a study as missing data. We then apply diagnostics to address the fit of our imputation model. Even after reducing the scale of our application, we were still unable to produce accurate imputations of the missing values. We describe those features of the data that made it difficult to harmonize the depression measures and provide some guidelines for using multiple imputation for harmonization in IPD meta-analysis.
Mapping change of older forest with nearest-neighbor imputation and Landsat time-series
Janet L. Ohmann; Matthew J. Gregory; Heather M. Roberts; Warren B. Cohen; Robert E. Kennedy; Zhiqiang. Yang
2012-01-01
The Northwest Forest Plan (NWFP), which aims to conserve late-successional and old-growth forests (older forests) and associated species, established new policies on federal lands in the Pacific Northwest USA. As part of monitoring for the NWFP, we tested nearest-neighbor imputation for mapping change in older forest, defined by threshold values for forest attributes...
DEFF Research Database (Denmark)
Meseck, Kristin; Jankowska, Marta M; Schipperijn, Jasper
2016-01-01
The main purpose of the present study was to assess the impact of global positioning system (GPS) signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate...
Directory of Open Access Journals (Sweden)
Kwangwoo Kim
Full Text Available The genetic association of HLA-DRB1 with rheumatoid arthritis (RA and systemic lupus erythematosus (SLE is well documented, but association with other HLA-DR beta genes (HLA-DRB3, HLA-DRB4 and HLA-DRB5 has not been thoroughly studied, despite their similar functions and chromosomal positions. We examined variants in all functional HLA-DR beta genes in RA and SLE patients and controls, down to the amino-acid level, to better understand disease association with the HLA-DR locus. To this end, we improved an existing HLA reference panel to impute variants in all protein-coding HLA-DR beta genes. Using the reference panel, HLA variants were inferred from high-density SNP data of 9,271 RA-control subjects and 5,342 SLE-control subjects. Disease association tests were performed by logistic regression and log-likelihood ratio tests. After imputation using the newly constructed HLA reference panel and statistical analysis, we observed that HLA-DRB1 variants better accounted for the association between MHC and susceptibility to RA and SLE than did the other three HLA-DRB variants. Moreover, there were no secondary effects in HLA-DRB3, HLA-DRB4, or HLA-DRB5 in RA or SLE. Of all the HLA-DR beta chain paralogs, those encoded by HLA-DRB1 solely or dominantly influence susceptibility to RA and SLE.
Directory of Open Access Journals (Sweden)
Busch Michael P
2007-12-01
Full Text Available Abstract Background Chronic hepatitis C virus infection is prevalent and often causes hepatic fibrosis, which can progress to cirrhosis and cause liver cancer or liver failure. Study of fibrosis progression often relies on imputing the time of infection, often as the reported age of first injection drug use. We sought to examine the accuracy of such imputation and implications for modeling factors that influence progression rates. Methods We analyzed cross-sectional data on hepatitis C antibody status and reported risk factor histories from two large studies, the Women's Interagency HIV Study and the Urban Health Study, using modern survival analysis methods for current status data to model past infection risk year by year. We compared fitted distributions of past infection risk to reported age of first injection drug use. Results Although injection drug use appeared to be a very strong risk factor, models for both studies showed that many subjects had considerable probability of having been infected substantially before or after their reported age of first injection drug use. Persons reporting younger age of first injection drug use were more likely to have been infected after, and persons reporting older age of first injection drug use were more likely to have been infected before. Conclusion In cross-sectional studies of fibrosis progression where date of HCV infection is estimated from risk factor histories, modern methods such as multiple imputation should be used to account for the substantial uncertainty about when infection occurred. The models presented here can provide the inputs needed by such methods. Using reported age of first injection drug use as the time of infection in studies of fibrosis progression is likely to produce a spuriously strong association of younger age of infection with slower rate of progression.
Harrell , Jr , Frank E
2015-01-01
This highly anticipated second edition features new chapters and sections, 225 new references, and comprehensive R software. In keeping with the previous edition, this book is about the art and science of data analysis and predictive modeling, which entails choosing and using multiple tools. Instead of presenting isolated techniques, this text emphasizes problem solving strategies that address the many issues arising when developing multivariable models using real data and not standard textbook examples. It includes imputation methods for dealing with missing data effectively, methods for fitting nonlinear relationships and for making the estimation of transformations a formal part of the modeling process, methods for dealing with "too many variables to analyze and not enough observations," and powerful model validation techniques based on the bootstrap. The reader will gain a keen understanding of predictive accuracy, and the harm of categorizing continuous predictors or outcomes. This text realistically...
Directory of Open Access Journals (Sweden)
Stanley Xu
2014-05-01
Full Text Available In studies that use electronic health record data, imputation of important data elements such as Glycated hemoglobin (A1c has become common. However, few studies have systematically examined the validity of various imputation strategies for missing A1c values. We derived a complete dataset using an incident diabetes population that has no missing values in A1c, fasting and random plasma glucose (FPG and RPG, age, and gender. We then created missing A1c values under two assumptions: missing completely at random (MCAR and missing at random (MAR. We then imputed A1c values, compared the imputed values to the true A1c values, and used these data to assess the impact of A1c on initiation of antihyperglycemic therapy. Under MCAR, imputation of A1c based on FPG 1 estimated a continuous A1c within ± 1.88% of the true A1c 68.3% of the time; 2 estimated a categorical A1c within ± one category from the true A1c about 50% of the time. Including RPG in imputation slightly improved the precision but did not improve the accuracy. Under MAR, including gender and age in addition to FPG improved the accuracy of imputed continuous A1c but not categorical A1c. Moreover, imputation of up to 33% of missing A1c values did not change the accuracy and precision and did not alter the impact of A1c on initiation of antihyperglycemic therapy. When using A1c values as a predictor variable, a simple imputation algorithm based only on age, sex, and fasting plasma glucose gave acceptable results.
DEFF Research Database (Denmark)
Christensen, Steen; Moore, C.; Doherty, J.
2006-01-01
accurate and required a few hundred model calls to be computed. (b) The linearized regression-based interval (Cooley, 2004) required just over a hundred model calls and also appeared to be nearly correct. (c) The calibration-constrained Monte-Carlo interval (Doherty, 2003) was found to be narrower than......For a synthetic case we computed three types of individual prediction intervals for the location of the aquifer entry point of a particle that moves through a heterogeneous aquifer and ends up in a pumping well. (a) The nonlinear regression-based interval (Cooley, 2004) was found to be nearly...... the regression-based intervals but required about half a million model calls. It is unclear whether or not this type of prediction interval is accurate....
Gaussian process regression analysis for functional data
Shi, Jian Qing
2011-01-01
Gaussian Process Regression Analysis for Functional Data presents nonparametric statistical methods for functional regression analysis, specifically the methods based on a Gaussian process prior in a functional space. The authors focus on problems involving functional response variables and mixed covariates of functional and scalar variables.Covering the basics of Gaussian process regression, the first several chapters discuss functional data analysis, theoretical aspects based on the asymptotic properties of Gaussian process regression models, and new methodological developments for high dime
Hoogerheide, L.F.; Kaashoek, J.F.; van Dijk, H.K.
2007-01-01
Likelihoods and posteriors of instrumental variable (IV) regression models with strong endogeneity and/or weak instruments may exhibit rather non-elliptical contours in the parameter space. This may seriously affect inference based on Bayesian credible sets. When approximating posterior
L.F. Hoogerheide (Lennart); J.F. Kaashoek (Johan); H.K. van Dijk (Herman)
2005-01-01
textabstractLikelihoods and posteriors of instrumental variable regression models with strong endogeneity and/or weak instruments may exhibit rather non-elliptical contours in the parameter space. This may seriously affect inference based on Bayesian credible sets. When approximating such contours
Understanding logistic regression analysis
Sperandei, Sandro
2014-01-01
Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain the logistic regression procedure using ex...
Introduction to regression graphics
Cook, R Dennis
2009-01-01
Covers the use of dynamic and interactive computer graphics in linear regression analysis, focusing on analytical graphics. Features new techniques like plot rotation. The authors have composed their own regression code, using Xlisp-Stat language called R-code, which is a nearly complete system for linear regression analysis and can be utilized as the main computer program in a linear regression course. The accompanying disks, for both Macintosh and Windows computers, contain the R-code and Xlisp-Stat. An Instructor's Manual presenting detailed solutions to all the problems in the book is ava
Imputing forest carbon stock estimates from inventory plots to a nationally continuous coverage
Directory of Open Access Journals (Sweden)
Wilson Barry Tyler
2013-01-01
Full Text Available Abstract The U.S. has been providing national-scale estimates of forest carbon (C stocks and stock change to meet United Nations Framework Convention on Climate Change (UNFCCC reporting requirements for years. Although these currently are provided as national estimates by pool and year to meet greenhouse gas monitoring requirements, there is growing need to disaggregate these estimates to finer scales to enable strategic forest management and monitoring activities focused on various ecosystem services such as C storage enhancement. Through application of a nearest-neighbor imputation approach, spatially extant estimates of forest C density were developed for the conterminous U.S. using the U.S.’s annual forest inventory. Results suggest that an existing forest inventory plot imputation approach can be readily modified to provide raster maps of C density across a range of pools (e.g., live tree to soil organic carbon and spatial scales (e.g., sub-county to biome. Comparisons among imputed maps indicate strong regional differences across C pools. The C density of pools closely related to detrital input (e.g., dead wood is often highest in forests suffering from recent mortality events such as those in the northern Rocky Mountains (e.g., beetle infestations. In contrast, live tree carbon density is often highest on the highest quality forest sites such as those found in the Pacific Northwest. Validation results suggest strong agreement between the estimates produced from the forest inventory plots and those from the imputed maps, particularly when the C pool is closely associated with the imputation model (e.g., aboveground live biomass and live tree basal area, with weaker agreement for detrital pools (e.g., standing dead trees. Forest inventory imputed plot maps provide an efficient and flexible approach to monitoring diverse C pools at national (e.g., UNFCCC and regional scales (e.g., Reducing Emissions from Deforestation and Forest
Wood, Andrew R; Perry, John R B; Tanaka, Toshiko; Hernandez, Dena G; Zheng, Hou-Feng; Melzer, David; Gibbs, J Raphael; Nalls, Michael A; Weedon, Michael N; Spector, Tim D; Richards, J Brent; Bandinelli, Stefania; Ferrucci, Luigi; Singleton, Andrew B; Frayling, Timothy M
2013-01-01
Genome-wide association (GWA) studies have been limited by the reliance on common variants present on microarrays or imputable from the HapMap Project data. More recently, the completion of the 1000 Genomes Project has provided variant and haplotype information for several million variants derived from sequencing over 1,000 individuals. To help understand the extent to which more variants (including low frequency (1% ≤ MAF 1000 Genomes imputation, respectively, and 9 and 11 that reached a stricter, likely conservative, threshold of P1000 Genomes genotype data modestly improved the strength of known associations. Of 20 associations detected at P1000 Genomes imputed data and one was nominally more strongly associated in HapMap imputed data. We also detected an association between a low frequency variant and phenotype that was previously missed by HapMap based imputation approaches. An association between rs112635299 and alpha-1 globulin near the SERPINA gene represented the known association between rs28929474 (MAF = 0.007) and alpha1-antitrypsin that predisposes to emphysema (P = 2.5×10(-12)). Our data provide important proof of principle that 1000 Genomes imputation will detect novel, low frequency-large effect associations.
DEFF Research Database (Denmark)
Dassonneville, R; Brøndum, Rasmus Froberg; Druet, T
2011-01-01
The purpose of this study was to investigate the imputation error and loss of reliability of direct genomic values (DGV) or genomically enhanced breeding values (GEBV) when using genotypes imputed from a 3,000-marker single nucleotide polymorphism (SNP) panel to a 50,000-marker SNP panel. Data...... of missing markers and prediction of breeding values were performed using 2 different reference populations in each country: either a national reference population or a combined EuroGenomics reference population. Validation for accuracy of imputation and genomic prediction was done based on national test...... with a national reference data set gave an absolute loss of 0.05 in mean reliability of GEBV in the French study, whereas a loss of 0.03 was obtained for reliability of DGV in the Nordic study. When genotypes were imputed using the EuroGenomics reference, a loss of 0.02 in mean reliability of GEBV was detected...
Moiseeva, A.; Jessurun, A.J.; Timmermans, H.J.P.; Stopher, P.
2016-01-01
Anastasia Moiseeva, Joran Jessurun and Harry Timmermans (2010), ‘Semiautomatic Imputation of Activity Travel Diaries: Use of Global Positioning System Traces, Prompted Recall, and Context-Sensitive Learning Algorithms’, Transportation Research Record: Journal of the Transportation Research Board,
Directory of Open Access Journals (Sweden)
Danai Jattawa
2016-04-01
Full Text Available The objective of this study was to investigate the accuracy of imputation from low density (LDC to moderate density SNP chips (MDC in a Thai Holstein-Other multibreed dairy cattle population. Dairy cattle with complete pedigree information (n = 1,244 from 145 dairy farms were genotyped with GeneSeek GGP20K (n = 570, GGP26K (n = 540 and GGP80K (n = 134 chips. After checking for single nucleotide polymorphism (SNP quality, 17,779 SNP markers in common between the GGP20K, GGP26K, and GGP80K were used to represent MDC. Animals were divided into two groups, a reference group (n = 912 and a test group (n = 332. The SNP markers chosen for the test group were those located in positions corresponding to GeneSeek GGP9K (n = 7,652. The LDC to MDC genotype imputation was carried out using three different software packages, namely Beagle 3.3 (population-based algorithm, FImpute 2.2 (combined family- and population-based algorithms and Findhap 4 (combined family- and population-based algorithms. Imputation accuracies within and across chromosomes were calculated as ratios of correctly imputed SNP markers to overall imputed SNP markers. Imputation accuracy for the three software packages ranged from 76.79% to 93.94%. FImpute had higher imputation accuracy (93.94% than Findhap (84.64% and Beagle (76.79%. Imputation accuracies were similar and consistent across chromosomes for FImpute, but not for Findhap and Beagle. Most chromosomes that showed either high (73% or low (80% imputation accuracies were the same chromosomes that had above and below average linkage disequilibrium (LD; defined here as the correlation between pairs of adjacent SNP within chromosomes less than or equal to 1 Mb apart. Results indicated that FImpute was more suitable than Findhap and Beagle for genotype imputation in this Thai multibreed population. Perhaps additional increments in imputation accuracy could be achieved by increasing the completeness of pedigree information.
Peng, Ying; Li, Su-Ning; Pei, Xuexue; Hao, Kun
2018-03-01
Amultivariate regression statisticstrategy was developed to clarify multi-components content-effect correlation ofpanaxginseng saponins extract and predict the pharmacological effect by components content. In example 1, firstly, we compared pharmacological effects between panax ginseng saponins extract and individual saponin combinations. Secondly, we examined the anti-platelet aggregation effect in seven different saponin combinations of ginsenoside Rb1, Rg1, Rh, Rd, Ra3 and notoginsenoside R1. Finally, the correlation between anti-platelet aggregation and the content of multiple components was analyzed by a partial least squares algorithm. In example 2, firstly, 18 common peaks were identified in ten different batches of panax ginseng saponins extracts from different origins. Then, we investigated the anti-myocardial ischemia reperfusion injury effects of the ten different panax ginseng saponins extracts. Finally, the correlation between the fingerprints and the cardioprotective effects was analyzed by a partial least squares algorithm. Both in example 1 and 2, the relationship between the components content and pharmacological effect was modeled well by the partial least squares regression equations. Importantly, the predicted effect curve was close to the observed data of dot marked on the partial least squares regression model. This study has given evidences that themulti-component content is a promising information for predicting the pharmacological effects of traditional Chinese medicine.
Directory of Open Access Journals (Sweden)
Ying Peng
2018-03-01
Full Text Available Amultivariate regression statisticstrategy was developed to clarify multi-components content-effect correlation ofpanaxginseng saponins extract and predict the pharmacological effect by components content. In example 1, firstly, we compared pharmacological effects between panax ginseng saponins extract and individual saponin combinations. Secondly, we examined the anti-platelet aggregation effect in seven different saponin combinations of ginsenoside Rb1, Rg1, Rh, Rd, Ra3 and notoginsenoside R1. Finally, the correlation between anti-platelet aggregation and the content of multiple components was analyzed by a partial least squares algorithm. In example 2, firstly, 18 common peaks were identified in ten different batches of panax ginseng saponins extracts from different origins. Then, we investigated the anti-myocardial ischemia reperfusion injury effects of the ten different panax ginseng saponins extracts. Finally, the correlation between the fingerprints and the cardioprotective effects was analyzed by a partial least squares algorithm. Both in example 1 and 2, the relationship between the components content and pharmacological effect was modeled well by the partial least squares regression equations. Importantly, the predicted effect curve was close to the observed data of dot marked on the partial least squares regression model. This study has given evidences that themulti-component content is a promising information for predicting the pharmacological effects of traditional Chinese medicine.
DEFF Research Database (Denmark)
Andersen, Andreas; Rieckmann, Andreas
2016-01-01
In this article, we illustrate how to use mi impute chained with intreg to fit an analysis of covariance analysis of censored and nondetectable immunological concentrations measured in a randomized pretest–posttest design.......In this article, we illustrate how to use mi impute chained with intreg to fit an analysis of covariance analysis of censored and nondetectable immunological concentrations measured in a randomized pretest–posttest design....
Wang, Tianyu; Nabavi, Sheida
2018-04-24
Differential gene expression analysis is one of the significant efforts in single cell RNA sequencing (scRNAseq) analysis to discover the specific changes in expression levels of individual cell types. Since scRNAseq exhibits multimodality, large amounts of zero counts, and sparsity, it is different from the traditional bulk RNA sequencing (RNAseq) data. The new challenges of scRNAseq data promote the development of new methods for identifying differentially expressed (DE) genes. In this study, we proposed a new method, SigEMD, that combines a data imputation approach, a logistic regression model and a nonparametric method based on the Earth Mover's Distance, to precisely and efficiently identify DE genes in scRNAseq data. The regression model and data imputation are used to reduce the impact of large amounts of zero counts, and the nonparametric method is used to improve the sensitivity of detecting DE genes from multimodal scRNAseq data. By additionally employing gene interaction network information to adjust the final states of DE genes, we further reduce the false positives of calling DE genes. We used simulated datasets and real datasets to evaluate the detection accuracy of the proposed method and to compare its performance with those of other differential expression analysis methods. Results indicate that the proposed method has an overall powerful performance in terms of precision in detection, sensitivity, and specificity. Copyright © 2018 Elsevier Inc. All rights reserved.
Principal component regression analysis with SPSS.
Liu, R X; Kuang, J; Gong, Q; Hou, X L
2003-06-01
The paper introduces all indices of multicollinearity diagnoses, the basic principle of principal component regression and determination of 'best' equation method. The paper uses an example to describe how to do principal component regression analysis with SPSS 10.0: including all calculating processes of the principal component regression and all operations of linear regression, factor analysis, descriptives, compute variable and bivariate correlations procedures in SPSS 10.0. The principal component regression analysis can be used to overcome disturbance of the multicollinearity. The simplified, speeded up and accurate statistical effect is reached through the principal component regression analysis with SPSS.
Imputing historical statistics, soils information, and other land-use data to crop area
Perry, C. R., Jr.; Willis, R. W.; Lautenschlager, L.
1982-01-01
In foreign crop condition monitoring, satellite acquired imagery is routinely used. To facilitate interpretation of this imagery, it is advantageous to have estimates of the crop types and their extent for small area units, i.e., grid cells on a map represent, at 60 deg latitude, an area nominally 25 by 25 nautical miles in size. The feasibility of imputing historical crop statistics, soils information, and other ancillary data to crop area for a province in Argentina is studied.
Kim, Kwangwoo; Bang, So-Young; Lee, Hye-Soon; Bae, Sang-Cheol
2014-01-01
Genetic variations of human leukocyte antigen (HLA) genes within the major histocompatibility complex (MHC) locus are strongly associated with disease susceptibility and prognosis for many diseases, including many autoimmune diseases. In this study, we developed a Korean HLA reference panel for imputing classical alleles and amino acid residues of several HLA genes. An HLA reference panel has potential for use in identifying and fine-mapping disease associations with the MHC locus in East Asian populations, including Koreans. A total of 413 unrelated Korean subjects were analyzed for single nucleotide polymorphisms (SNPs) at the MHC locus and six HLA genes, including HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1. The HLA reference panel was constructed by phasing the 5,858 MHC SNPs, 233 classical HLA alleles, and 1,387 amino acid residue markers from 1,025 amino acid positions as binary variables. The imputation accuracy of the HLA reference panel was assessed by measuring concordance rates between imputed and genotyped alleles of the HLA genes from a subset of the study subjects and East Asian HapMap individuals. Average concordance rates were 95.6% and 91.1% at 2-digit and 4-digit allele resolutions, respectively. The imputation accuracy was minimally affected by SNP density of a test dataset for imputation. In conclusion, the Korean HLA reference panel we developed was highly suitable for imputing HLA alleles and amino acids from MHC SNPs in East Asians, including Koreans.
Directory of Open Access Journals (Sweden)
Kwangwoo Kim
Full Text Available Genetic variations of human leukocyte antigen (HLA genes within the major histocompatibility complex (MHC locus are strongly associated with disease susceptibility and prognosis for many diseases, including many autoimmune diseases. In this study, we developed a Korean HLA reference panel for imputing classical alleles and amino acid residues of several HLA genes. An HLA reference panel has potential for use in identifying and fine-mapping disease associations with the MHC locus in East Asian populations, including Koreans. A total of 413 unrelated Korean subjects were analyzed for single nucleotide polymorphisms (SNPs at the MHC locus and six HLA genes, including HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1. The HLA reference panel was constructed by phasing the 5,858 MHC SNPs, 233 classical HLA alleles, and 1,387 amino acid residue markers from 1,025 amino acid positions as binary variables. The imputation accuracy of the HLA reference panel was assessed by measuring concordance rates between imputed and genotyped alleles of the HLA genes from a subset of the study subjects and East Asian HapMap individuals. Average concordance rates were 95.6% and 91.1% at 2-digit and 4-digit allele resolutions, respectively. The imputation accuracy was minimally affected by SNP density of a test dataset for imputation. In conclusion, the Korean HLA reference panel we developed was highly suitable for imputing HLA alleles and amino acids from MHC SNPs in East Asians, including Koreans.
Regression Analysis by Example. 5th Edition
Chatterjee, Samprit; Hadi, Ali S.
2012-01-01
Regression analysis is a conceptually simple method for investigating relationships among variables. Carrying out a successful application of regression analysis, however, requires a balance of theoretical results, empirical rules, and subjective judgment. "Regression Analysis by Example, Fifth Edition" has been expanded and thoroughly…
Design of a bovine low-density SNP array optimized for imputation.
Directory of Open Access Journals (Sweden)
Didier Boichard
Full Text Available The Illumina BovineLD BeadChip was designed to support imputation to higher density genotypes in dairy and beef breeds by including single-nucleotide polymorphisms (SNPs that had a high minor allele frequency as well as uniform spacing across the genome except at the ends of the chromosome where densities were increased. The chip also includes SNPs on the Y chromosome and mitochondrial DNA loci that are useful for determining subspecies classification and certain paternal and maternal breed lineages. The total number of SNPs was 6,909. Accuracy of imputation to Illumina BovineSNP50 genotypes using the BovineLD chip was over 97% for most dairy and beef populations. The BovineLD imputations were about 3 percentage points more accurate than those from the Illumina GoldenGate Bovine3K BeadChip across multiple populations. The improvement was greatest when neither parent was genotyped. The minor allele frequencies were similar across taurine beef and dairy breeds as was the proportion of SNPs that were polymorphic. The new BovineLD chip should facilitate low-cost genomic selection in taurine beef and dairy cattle.
Understanding logistic regression analysis.
Sperandei, Sandro
2014-01-01
Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain the logistic regression procedure using examples to make it as simple as possible. After definition of the technique, the basic interpretation of the results is highlighted and then some special issues are discussed.
Weisberg, Sanford
2013-01-01
Praise for the Third Edition ""...this is an excellent book which could easily be used as a course text...""-International Statistical Institute The Fourth Edition of Applied Linear Regression provides a thorough update of the basic theory and methodology of linear regression modeling. Demonstrating the practical applications of linear regression analysis techniques, the Fourth Edition uses interesting, real-world exercises and examples. Stressing central concepts such as model building, understanding parameters, assessing fit and reliability, and drawing conclusions, the new edition illus
Hosmer, David W; Sturdivant, Rodney X
2013-01-01
A new edition of the definitive guide to logistic regression modeling for health science and other applications This thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables. Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software. The book provides readers with state-of-
Survival analysis II: Cox regression
Stel, Vianda S.; Dekker, Friedo W.; Tripepi, Giovanni; Zoccali, Carmine; Jager, Kitty J.
2011-01-01
In contrast to the Kaplan-Meier method, Cox proportional hazards regression can provide an effect estimate by quantifying the difference in survival between patient groups and can adjust for confounding effects of other variables. The purpose of this article is to explain the basic concepts of the
Directory of Open Access Journals (Sweden)
Pesch, Beate
2013-03-01
Full Text Available [english] In some applications statisticians are confronted with values which are reported to be below a limit of detection or quantitation. These left-censored variables are a challenge in the statistical analysis. In a simulation study, we compare different methods to deal with this type of data in statistical applications. These include measures of location, dispersion, association, and statistical modeling. Our simulation study showed that the multiple imputation approach and the Tobit regression lead to unbiased estimates, whereas the naïve methods including simple substitution of non-detects lead to unreliable estimates. We illustrate the application of the multiple imputation approach and the Tobit regression with an example from occupational epidemiology. [german] In der statistischen Praxis treten immer wieder Variablen mit Werten unterhalb einer Bestimmungs- oder Nachweisgrenze auf. Diese sind linkszensiert und stellen daher eine Herausforderung für die statistische Analyse dar. Im Rahmen einer Simulationsstudie vergleichen wir Schätzmethoden zur Berechnung von Lage- und Streuungmaßen, Korrelationen und Regressionsparametern bei diesen Variablen. Unsere Ergebnisse zeigen, dass die multiple Imputationsmethode und die Tobit Regression zu unverzerrten Schätzungen führen. Naive Methoden, einschließlich der einfachen Substitution von zensierten Beobachtungen, ergeben hingegen unzuverlässige Schätzungen. Wir illustrieren die Anwendung der multiplen Imputationsmethode und der Tobit Regression anhand eines Beispiels aus der Epidemiologie der Arbeitswelt.
Schiewe, M C; Fitz, T A; Brown, J L; Stuart, L D; Wildt, D E
1991-09-01
Ewes were treated with exogenous follicle-stimulating hormone (FSH) and oestrus was synchronized using either a dual prostaglandin F-2 alpha (PGF-2 alpha) injection regimen or pessaries impregnated with medroxy progesterone acetate (MAP). Natural cycling ewes served as controls. After oestrus or AI (Day 0), corpora lutea (CL) were enucleated surgically from the left and right ovaries on Days 3 and 6, respectively. The incidence of premature luteolysis was related (P less than 0.05) to PGF-2 alpha treatment and occurred in 7 of 8 ewes compared with 0 of 4 controls and 1 of 8 MAP-exposed females. Sheep with regressing CL had lower circulating and intraluteal progesterone concentrations and fewer total and small dissociated luteal cells on Day 3 than gonadotrophin-treated counterparts with normal CL. Progesterone concentration in the serum and luteal tissue was higher (P less than 0.05) in gonadotrophin-treated ewes with normal CL than in the controls; but luteinizing hormone (LH) receptors/cell were not different on Days 3 and 6. There were no apparent differences in the temporal patterns of circulating oestradiol-17 beta, FSH and LH. High progesterone in gonadotrophin-treated ewes with normal CL coincided with an increase in total luteal mass and numbers of cells, which were primarily reflected in more small luteal cells than in control ewes. Gonadotrophin-treated ewes with regressing CL on Day 3 tended (P less than 0.10) to have fewer small luteal cells and fewer (P less than 0.05) low-affinity PGF-2 alpha binding sites than sheep with normal CL. By Day 6, luteal integrity and cell viability was absent in ewes with prematurely regressed CL. These data demonstrate that (i) the incidence of premature luteal regression is highly correlated with the use of PGF-2 alpha; (ii) this abnormal luteal tissue is functionally competent for 2-3 days after ovulation, but deteriorates rapidly thereafter and (iii) luteal-dysfunctioning ewes experience a reduction in numbers of
A web-based approach to data imputation
Li, Zhixu; Sharaf, Mohamed Abdel Fattah; Sitbon, Laurianne; Sadiq, Shazia Wasim; Indulska, Marta; Zhou, Xiaofang
2013-01-01
principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme
Methods library of embedded R functions at Statistics Norway
Directory of Open Access Journals (Sweden)
Øyvind Langsrud
2017-11-01
Full Text Available Statistics Norway is modernising the production processes. An important element in this work is a library of functions for statistical computations. In principle, the functions in such a methods library can be programmed in several languages. A modernised production environment demand that these functions can be reused for different statistics products, and that they are embedded within a common IT system. The embedding should be done in such a way that the users of the methods do not need to know the underlying programming language. As a proof of concept, Statistics Norway soon has established a methods library offering a limited number of methods for macro-editing, imputation and confidentiality. This is done within an area of municipal statistics with R as the only programming language. This paper presents the details and experiences from this work. The problem of fitting real word applications to simple and strict standards is discussed and exemplified by the development of solutions to regression imputation and table suppression.
Correlation and simple linear regression.
Zou, Kelly H; Tuncali, Kemal; Silverman, Stuart G
2003-06-01
In this tutorial article, the concepts of correlation and regression are reviewed and demonstrated. The authors review and compare two correlation coefficients, the Pearson correlation coefficient and the Spearman rho, for measuring linear and nonlinear relationships between two continuous variables. In the case of measuring the linear relationship between a predictor and an outcome variable, simple linear regression analysis is conducted. These statistical concepts are illustrated by using a data set from published literature to assess a computed tomography-guided interventional technique. These statistical methods are important for exploring the relationships between variables and can be applied to many radiologic studies.
Regression filter for signal resolution
International Nuclear Information System (INIS)
Matthes, W.
1975-01-01
The problem considered is that of resolving a measured pulse height spectrum of a material mixture, e.g. gamma ray spectrum, Raman spectrum, into a weighed sum of the spectra of the individual constituents. The model on which the analytical formulation is based is described. The problem reduces to that of a multiple linear regression. A stepwise linear regression procedure was constructed. The efficiency of this method was then tested by transforming the procedure in a computer programme which was used to unfold test spectra obtained by mixing some spectra, from a library of arbitrary chosen spectra, and adding a noise component. (U.K.)
Logistic regression for dichotomized counts.
Preisser, John S; Das, Kalyan; Benecha, Habtamu; Stamm, John W
2016-12-01
Sometimes there is interest in a dichotomized outcome indicating whether a count variable is positive or zero. Under this scenario, the application of ordinary logistic regression may result in efficiency loss, which is quantifiable under an assumed model for the counts. In such situations, a shared-parameter hurdle model is investigated for more efficient estimation of regression parameters relating to overall effects of covariates on the dichotomous outcome, while handling count data with many zeroes. One model part provides a logistic regression containing marginal log odds ratio effects of primary int