WorldWideScience

Sample records for imputing missing values

  1. Missing value imputation for epistatic MAPs

    LENUS (Irish Health Repository)

    Ryan, Colm

    2010-04-20

    Abstract Background Epistatic miniarray profiling (E-MAPs) is a high-throughput approach capable of quantifying aggravating or alleviating genetic interactions between gene pairs. The datasets resulting from E-MAP experiments typically take the form of a symmetric pairwise matrix of interaction scores. These datasets have a significant number of missing values - up to 35% - that can reduce the effectiveness of some data analysis techniques and prevent the use of others. An effective method for imputing interactions would therefore increase the types of possible analysis, as well as increase the potential to identify novel functional interactions between gene pairs. Several methods have been developed to handle missing values in microarray data, but it is unclear how applicable these methods are to E-MAP data because of their pairwise nature and the significantly larger number of missing values. Here we evaluate four alternative imputation strategies, three local (Nearest neighbor-based) and one global (PCA-based), that have been modified to work with symmetric pairwise data. Results We identify different categories for the missing data based on their underlying cause, and show that values from the largest category can be imputed effectively. We compare local and global imputation approaches across a variety of distinct E-MAP datasets, showing that both are competitive and preferable to filling in with zeros. In addition we show that these methods are effective in an E-MAP from a different species, suggesting that pairwise imputation techniques will be increasingly useful as analogous epistasis mapping techniques are developed in different species. We show that strongly alleviating interactions are significantly more difficult to predict than strongly aggravating interactions. Finally we show that imputed interactions, generated using nearest neighbor methods, are enriched for annotations in the same manner as measured interactions. Therefore our method potentially

  2. Missing value imputation: with application to handwriting data

    Science.gov (United States)

    Xu, Zhen; Srihari, Sargur N.

    2015-01-01

    Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research, missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal data with missing values in handwriting analysis. In the task of studying development of individuality of handwriting, we encountered the fact that feature values are missing for several individuals at several time instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation, and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM), are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and low computational cost.

  3. Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data.

    Science.gov (United States)

    Sehgal, Muhammad Shoaib B; Gondal, Iqbal; Dooley, Laurence S

    2005-05-15

    Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE

  4. Clustering with Missing Values: No Imputation Required

    Science.gov (United States)

    Wagstaff, Kiri

    2004-01-01

    Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.

  5. Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets

    Directory of Open Access Journals (Sweden)

    Min-Wei Huang

    2018-01-01

    Full Text Available Many real-world medical datasets contain some proportion of missing (attribute values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.

  6. Missing Value Imputation Based on Gaussian Mixture Model for the Internet of Things

    Directory of Open Access Journals (Sweden)

    Xiaobo Yan

    2015-01-01

    Full Text Available This paper addresses missing value imputation for the Internet of Things (IoT. Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the reduction in the accuracy and reliability of the data analysis results. This paper, for the characteristics of the data itself and the features of missing data in IoT, divides the missing data into three types and defines three corresponding missing value imputation problems. Then, we propose three new models to solve the corresponding problems, and they are model of missing value imputation based on context and linear mean (MCL, model of missing value imputation based on binary search (MBS, and model of missing value imputation based on Gaussian mixture model (MGI. Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.

  7. Missing data imputation: focusing on single imputation.

    Science.gov (United States)

    Zhang, Zhongheng

    2016-01-01

    Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations.

  8. Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.

    Science.gov (United States)

    Ritz, Cecilia; Edén, Patrik

    2008-01-19

    For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.

  9. Missing value imputation in DNA microarrays based on conjugate gradient method.

    Science.gov (United States)

    Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh

    2012-02-01

    Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.

  10. Missing value imputation for microarray gene expression data using histone acetylation information

    Directory of Open Access Journals (Sweden)

    Feng Jihua

    2008-05-01

    Full Text Available Abstract Background It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages. Results The paper explores the feasibility of doing missing value imputation with the help of gene regulatory mechanism. An imputation framework called histone acetylation information aided imputation method (HAIimpute method is presented. It incorporates the histone acetylation information into the conventional KNN(k-nearest neighbor and LLS(local least square imputation algorithms for final prediction of the missing values. The experimental results indicated that the use of acetylation information can provide significant improvements in microarray imputation accuracy. The HAIimpute methods consistently improve the widely used methods such as KNN and LLS in terms of normalized root mean squared error (NRMSE. Meanwhile, the genes imputed by HAIimpute methods are more correlated with the original complete genes in terms of Pearson correlation coefficients. Furthermore, the proposed methods also outperform GOimpute, which is one of the existing related methods that use the functional similarity as the external information. Conclusion We demonstrated that the using of histone acetylation information could greatly improve the performance of the imputation especially at high missing percentages. This idea can be generalized to various imputation methods to facilitate the performance. Moreover, with more knowledge accumulated on gene regulatory mechanism in addition to histone acetylation, the performance of our approach can be further improved and verified.

  11. Two-pass imputation algorithm for missing value estimation in gene expression time series.

    Science.gov (United States)

    Tsiporkova, Elena; Boeva, Veselka

    2007-10-01

    Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different

  12. Flexible Imputation of Missing Data

    CERN Document Server

    van Buuren, Stef

    2012-01-01

    Missing data form a problem in every scientific discipline, yet the techniques required to handle them are complicated and often lacking. One of the great ideas in statistical science--multiple imputation--fills gaps in the data with plausible values, the uncertainty of which is coded in the data itself. It also solves other problems, many of which are missing data problems in disguise. Flexible Imputation of Missing Data is supported by many examples using real data taken from the author's vast experience of collaborative research, and presents a practical guide for handling missing data unde

  13. Comparison of missing value imputation methods in time series: the case of Turkish meteorological data

    Science.gov (United States)

    Yozgatligil, Ceylan; Aslan, Sipan; Iyigun, Cem; Batmaz, Inci

    2013-04-01

    This study aims to compare several imputation methods to complete the missing values of spatio-temporal meteorological time series. To this end, six imputation methods are assessed with respect to various criteria including accuracy, robustness, precision, and efficiency for artificially created missing data in monthly total precipitation and mean temperature series obtained from the Turkish State Meteorological Service. Of these methods, simple arithmetic average, normal ratio (NR), and NR weighted with correlations comprise the simple ones, whereas multilayer perceptron type neural network and multiple imputation strategy adopted by Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) are computationally intensive ones. In addition, we propose a modification on the EM-MCMC method. Besides using a conventional accuracy measure based on squared errors, we also suggest the correlation dimension (CD) technique of nonlinear dynamic time series analysis which takes spatio-temporal dependencies into account for evaluating imputation performances. Depending on the detailed graphical and quantitative analysis, it can be said that although computational methods, particularly EM-MCMC method, are computationally inefficient, they seem favorable for imputation of meteorological time series with respect to different missingness periods considering both measures and both series studied. To conclude, using the EM-MCMC algorithm for imputing missing values before conducting any statistical analyses of meteorological data will definitely decrease the amount of uncertainty and give more robust results. Moreover, the CD measure can be suggested for the performance evaluation of missing data imputation particularly with computational methods since it gives more precise results in meteorological time series.

  14. Treatments of Missing Values in Large National Data Affect Conclusions: The Impact of Multiple Imputation on Arthroplasty Research.

    Science.gov (United States)

    Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Su, Edwin P; Grauer, Jonathan N

    2018-03-01

    Despite the advantages of large, national datasets, one continuing concern is missing data values. Complete case analysis, where only cases with complete data are analyzed, is commonly used rather than more statistically rigorous approaches such as multiple imputation. This study characterizes the potential selection bias introduced using complete case analysis and compares the results of common regressions using both techniques following unicompartmental knee arthroplasty. Patients undergoing unicompartmental knee arthroplasty were extracted from the 2005 to 2015 National Surgical Quality Improvement Program. As examples, the demographics of patients with and without missing preoperative albumin and hematocrit values were compared. Missing data were then treated with both complete case analysis and multiple imputation (an approach that reproduces the variation and associations that would have been present in a full dataset) and the conclusions of common regressions for adverse outcomes were compared. A total of 6117 patients were included, of which 56.7% were missing at least one value. Younger, female, and healthier patients were more likely to have missing preoperative albumin and hematocrit values. The use of complete case analysis removed 3467 patients from the study in comparison with multiple imputation which included all 6117 patients. The 2 methods of handling missing values led to differing associations of low preoperative laboratory values with commonly studied adverse outcomes. The use of complete case analysis can introduce selection bias and may lead to different conclusions in comparison with the statistically rigorous multiple imputation approach. Joint surgeons should consider the methods of handling missing values when interpreting arthroplasty research. Copyright © 2017 Elsevier Inc. All rights reserved.

  15. A nonparametric multiple imputation approach for missing categorical data

    Directory of Open Access Journals (Sweden)

    Muhan Zhou

    2017-06-01

    Full Text Available Abstract Background Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness probabilities. Methods We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model and the other fits a logistic regression for predicting missingness probabilities (the missingness model. A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with

  16. Missing Value Imputation Based on Gaussian Mixture Model for the Internet of Things

    OpenAIRE

    Yan, Xiaobo; Xiong, Weiqing; Hu, Liang; Wang, Feng; Zhao, Kuo

    2015-01-01

    This paper addresses missing value imputation for the Internet of Things (IoT). Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the red...

  17. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.

    Science.gov (United States)

    Lazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, Thomas

    2016-04-01

    Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.

  18. Imputation of missing data in time series for air pollutants

    Science.gov (United States)

    Junger, W. L.; Ponce de Leon, A.

    2015-02-01

    Missing data are major concerns in epidemiological studies of the health effects of environmental air pollutants. This article presents an imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution. Different approaches are considered for filtering the temporal component. A simulation study was performed to assess validity and performance of proposed method in comparison with some frequently used methods. Simulations showed that when the amount of missing data was as low as 5%, the complete data analysis yielded satisfactory results regardless of the generating mechanism of the missing data, whereas the validity began to degenerate when the proportion of missing values exceeded 10%. The proposed imputation method exhibited good accuracy and precision in different settings with respect to the patterns of missing observations. Most of the imputations obtained valid results, even under missing not at random. The methods proposed in this study are implemented as a package called mtsdi for the statistical software system R.

  19. Dealing with missing data in a multi-question depression scale: a comparison of imputation methods

    Directory of Open Access Journals (Sweden)

    Stuart Heather

    2006-12-01

    Full Text Available Abstract Background Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS. Methods 1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation. Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1 multiple imputation, 2 single regression, 3 individual mean, 4 overall mean, 5 participant's preceding response, and 6 random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated. Results When 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89, although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range

  20. Multiple imputation by chained equations for systematically and sporadically missing multilevel data.

    Science.gov (United States)

    Resche-Rigon, Matthieu; White, Ian R

    2018-06-01

    In multilevel settings such as individual participant data meta-analysis, a variable is 'systematically missing' if it is wholly missing in some clusters and 'sporadically missing' if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.

  1. Time Series Forecasting with Missing Values

    OpenAIRE

    Shin-Fu Wu; Chia-Yung Chang; Shie-Jue Lee

    2015-01-01

    Time series prediction has become more popular in various kinds of applications such as weather prediction, control engineering, financial analysis, industrial monitoring, etc. To deal with real-world problems, we are often faced with missing values in the data due to sensor malfunctions or human errors. Traditionally, the missing values are simply omitted or replaced by means of imputation methods. However, omitting those missing values may cause temporal discontinuity. Imputation methods, o...

  2. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

    Directory of Open Access Journals (Sweden)

    Lotz Meredith J

    2008-01-01

    Full Text Available Abstract Background Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. Results We found that the optimal imputation algorithms (LSA, LLS, and BPCA are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Conclusion Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA

  3. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.

    Science.gov (United States)

    Brock, Guy N; Shaffer, John R; Blakesley, Richard E; Lotz, Meredith J; Tseng, George C

    2008-01-10

    Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity

  4. A Comparison of Joint Model and Fully Conditional Specification Imputation for Multilevel Missing Data

    Science.gov (United States)

    Mistler, Stephen A.; Enders, Craig K.

    2017-01-01

    Multiple imputation methods can generally be divided into two broad frameworks: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution, whereas FCS imputes variables one at a time from a series of univariate conditional…

  5. Missing data imputation using statistical and machine learning methods in a real breast cancer problem.

    Science.gov (United States)

    Jerez, José M; Molina, Ignacio; García-Laencina, Pedro J; Alba, Emilio; Ribelles, Nuria; Martín, Miguel; Franco, Leonardo

    2010-10-01

    Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures. Copyright © 2010 Elsevier B.V. All rights reserved.

  6. Integrative missing value estimation for microarray data.

    Science.gov (United States)

    Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine

    2006-10-12

    Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.

  7. Time Series Forecasting with Missing Values

    Directory of Open Access Journals (Sweden)

    Shin-Fu Wu

    2015-11-01

    Full Text Available Time series prediction has become more popular in various kinds of applications such as weather prediction, control engineering, financial analysis, industrial monitoring, etc. To deal with real-world problems, we are often faced with missing values in the data due to sensor malfunctions or human errors. Traditionally, the missing values are simply omitted or replaced by means of imputation methods. However, omitting those missing values may cause temporal discontinuity. Imputation methods, on the other hand, may alter the original time series. In this study, we propose a novel forecasting method based on least squares support vector machine (LSSVM. We employ the input patterns with the temporal information which is defined as local time index (LTI. Time series data as well as local time indexes are fed to LSSVM for doing forecasting without imputation. We compare the forecasting performance of our method with other imputation methods. Experimental results show that the proposed method is promising and is worth further investigations.

  8. Imputation methods for filling missing data in urban air pollution data for Malaysia

    Directory of Open Access Journals (Sweden)

    Nur Afiqah Zakaria

    2018-06-01

    Full Text Available The air quality measurement data obtained from the continuous ambient air quality monitoring (CAAQM station usually contained missing data. The missing observations of the data usually occurred due to machine failure, routine maintenance and human error. In this study, the hourly monitoring data of CO, O3, PM10, SO2, NOx, NO2, ambient temperature and humidity were used to evaluate four imputation methods (Mean Top Bottom, Linear Regression, Multiple Imputation and Nearest Neighbour. The air pollutants observations were simulated into four percentages of simulated missing data i.e. 5%, 10%, 15% and 20%. Performance measures namely the Mean Absolute Error, Root Mean Squared Error, Coefficient of Determination and Index of Agreement were used to describe the goodness of fit of the imputation methods. From the results of the performance measures, Mean Top Bottom method was selected as the most appropriate imputation method for filling in the missing values in air pollutants data.

  9. Integrative missing value estimation for microarray data

    Directory of Open Access Journals (Sweden)

    Zhou Xianghong

    2006-10-01

    Full Text Available Abstract Background Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. Results We present the integrative Missing Value Estimation method (iMISS by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS imputation algorithm by up to 15% improvement in our benchmark tests. Conclusion We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.

  10. A New Missing Data Imputation Algorithm Applied to Electrical Data Loggers

    Directory of Open Access Journals (Sweden)

    Concepción Crespo Turrado

    2015-12-01

    Full Text Available Nowadays, data collection is a key process in the study of electrical power networks when searching for harmonics and a lack of balance among phases. In this context, the lack of data of any of the main electrical variables (phase-to-neutral voltage, phase-to-phase voltage, and current in each phase and power factor adversely affects any time series study performed. When this occurs, a data imputation process must be accomplished in order to substitute the data that is missing for estimated values. This paper presents a novel missing data imputation method based on multivariate adaptive regression splines (MARS and compares it with the well-known technique called multivariate imputation by chained equations (MICE. The results obtained demonstrate how the proposed method outperforms the MICE algorithm.

  11. Multiple imputation of missing passenger boarding data in the national census of ferry operators

    Science.gov (United States)

    2008-08-01

    This report presents findings from the 2006 National Census of Ferry Operators (NCFO) augmented with imputed values for passengers and passenger miles. Due to the imputation procedures used to calculate missing data, totals in Table 1 may not corresp...

  12. Application of Multiple Imputation for Missing Values in Three-Way Three-Mode Multi-Environment Trial Data.

    Science.gov (United States)

    Tian, Ting; McLachlan, Geoffrey J; Dieters, Mark J; Basford, Kaye E

    2015-01-01

    It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.

  13. Multiple imputation for multivariate data with missing and below-threshold measurements: time-series concentrations of pollutants in the Arctic.

    Science.gov (United States)

    Hopke, P K; Liu, C; Rubin, D B

    2001-03-01

    Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.

  14. FCMPSO: An Imputation for Missing Data Features in Heart Disease Classification

    Science.gov (United States)

    Salleh, Mohd Najib Mohd; Ashikin Samat, Nurul

    2017-08-01

    The application of data mining and machine learning in directing clinical research into possible hidden knowledge is becoming greatly influential in medical areas. Heart Disease is a killer disease around the world, and early prevention through efficient methods can help to reduce the mortality number. Medical data may contain many uncertainties, as they are fuzzy and vague in nature. Nonetheless, imprecise features data such as no values and missing values can affect quality of classification results. Nevertheless, the other complete features are still capable to give information in certain features. Therefore, an imputation approach based on Fuzzy C-Means and Particle Swarm Optimization (FCMPSO) is developed in preprocessing stage to help fill in the missing values. Then, the complete dataset is trained in classification algorithm, Decision Tree. The experiment is trained with Heart Disease dataset and the performance is analysed using accuracy, precision, and ROC values. Results show that the performance of Decision Tree is increased after the application of FCMSPO for imputation.

  15. Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions

    Science.gov (United States)

    Turrado, Concepción Crespo; López, María del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; de Cos Juez, Francisco Javier

    2014-01-01

    Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW. PMID:25356644

  16. Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions

    Directory of Open Access Journals (Sweden)

    Concepción Crespo Turrado

    2014-10-01

    Full Text Available Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE. This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW and Multiple Linear Regression (MLR. The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.

  17. Missing data imputation of solar radiation data under different atmospheric conditions.

    Science.gov (United States)

    Turrado, Concepción Crespo; López, María Del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; Juez, Francisco Javier de Cos

    2014-10-29

    Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.

  18. Simple nuclear norm based algorithms for imputing missing data and forecasting in time series

    OpenAIRE

    Butcher, Holly Louise; Gillard, Jonathan William

    2017-01-01

    There has been much recent progress on the use of the nuclear norm for the so-called matrix completion problem (the problem of imputing missing values of a matrix). In this paper we investigate the use of the nuclear norm for modelling time series, with particular attention to imputing missing data and forecasting. We introduce a simple alternating projections type algorithm based on the nuclear norm for these tasks, and consider a number of practical examples.

  19. Data driven estimation of imputation error-a strategy for imputation with a reject option

    DEFF Research Database (Denmark)

    Bak, Nikolaj; Hansen, Lars Kai

    2016-01-01

    Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values i...

  20. Randomly and Non-Randomly Missing Renal Function Data in the Strong Heart Study: A Comparison of Imputation Methods.

    Directory of Open Access Journals (Sweden)

    Nawar Shara

    Full Text Available Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS. Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991, 2 (1993-1995, and 3 (1998-1999 was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.

  1. Missing data treatments matter: an analysis of multiple imputation for anterior cervical discectomy and fusion procedures.

    Science.gov (United States)

    Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Cui, Jonathan J; Basques, Bryce A; Albert, Todd J; Grauer, Jonathan N

    2018-04-09

    The presence of missing data is a limitation of large datasets, including the National Surgical Quality Improvement Program (NSQIP). In addressing this issue, most studies use complete case analysis, which excludes cases with missing data, thus potentially introducing selection bias. Multiple imputation, a statistically rigorous approach that approximates missing data and preserves sample size, may be an improvement over complete case analysis. The present study aims to evaluate the impact of using multiple imputation in comparison with complete case analysis for assessing the associations between preoperative laboratory values and adverse outcomes following anterior cervical discectomy and fusion (ACDF) procedures. This is a retrospective review of prospectively collected data. Patients undergoing one-level ACDF were identified in NSQIP 2012-2015. Perioperative adverse outcome variables assessed included the occurrence of any adverse event, severe adverse events, and hospital readmission. Missing preoperative albumin and hematocrit values were handled using complete case analysis and multiple imputation. These preoperative laboratory levels were then tested for associations with 30-day postoperative outcomes using logistic regression. A total of 11,999 patients were included. Of this cohort, 63.5% of patients had missing preoperative albumin and 9.9% had missing preoperative hematocrit. When using complete case analysis, only 4,311 patients were studied. The removed patients were significantly younger, healthier, of a common body mass index, and male. Logistic regression analysis failed to identify either preoperative hypoalbuminemia or preoperative anemia as significantly associated with adverse outcomes. When employing multiple imputation, all 11,999 patients were included. Preoperative hypoalbuminemia was significantly associated with the occurrence of any adverse event and severe adverse events. Preoperative anemia was significantly associated with the

  2. Different methods for analysing and imputation missing values in wind speed series; La problematica de la calidad de la informacion en series de velocidad del viento-metodologias de analisis y imputacion de datos faltantes

    Energy Technology Data Exchange (ETDEWEB)

    Ferreira, A. M.

    2004-07-01

    This study concerns about different methods for analysing and imputation missing values in wind speed series. The algorithm EM and a methodology derivated from the sequential hot deck have been utilized. Series with missing values imputed are compared with original and complete series, using several criteria, such the wind potential; and appears to exist a significant goodness of fit between the estimates and real values. (Author)

  3. VIGAN: Missing View Imputation with Generative Adversarial Networks.

    Science.gov (United States)

    Shang, Chao; Palmer, Aaron; Sun, Jiangwen; Chen, Ko-Shin; Lu, Jin; Bi, Jinbo

    2017-01-01

    In an era when big data are becoming the norm, there is less concern with the quantity but more with the quality and completeness of the data. In many disciplines, data are collected from heterogeneous sources, resulting in multi-view or multi-modal datasets. The missing data problem has been challenging to address in multi-view data analysis. Especially, when certain samples miss an entire view of data, it creates the missing view problem. Classic multiple imputations or matrix completion methods are hardly effective here when no information can be based on in the specific view to impute data for such samples. The commonly-used simple method of removing samples with a missing view can dramatically reduce sample size, thus diminishing the statistical power of a subsequent analysis. In this paper, we propose a novel approach for view imputation via generative adversarial networks (GANs), which we name by VIGAN. This approach first treats each view as a separate domain and identifies domain-to-domain mappings via a GAN using randomly-sampled data from each view, and then employs a multi-modal denoising autoencoder (DAE) to reconstruct the missing view from the GAN outputs based on paired data across the views. Then, by optimizing the GAN and DAE jointly, our model enables the knowledge integration for domain mappings and view correspondences to effectively recover the missing view. Empirical results on benchmark datasets validate the VIGAN approach by comparing against the state of the art. The evaluation of VIGAN in a genetic study of substance use disorders further proves the effectiveness and usability of this approach in life science.

  4. Imputing data that are missing at high rates using a boosting algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Cauthen, Katherine Regina [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lambert, Gregory [Apple Inc., Cupertino, CA (United States); Ray, Jaideep [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Lefantzi, Sophia [Sandia National Lab. (SNL-CA), Livermore, CA (United States)

    2016-09-01

    Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. Here, we use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, and the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.

  5. Gaussian mixture clustering and imputation of microarray data.

    Science.gov (United States)

    Ouyang, Ming; Welsh, William J; Georgopoulos, Panos

    2004-04-12

    In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.

  6. Use of Multiple Imputation Method to Improve Estimation of Missing Baseline Serum Creatinine in Acute Kidney Injury Research

    Science.gov (United States)

    Peterson, Josh F.; Eden, Svetlana K.; Moons, Karel G.; Ikizler, T. Alp; Matheny, Michael E.

    2013-01-01

    Summary Background and objectives Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m2 (eGFR 75). Design, setting, participants, & measurements From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a “missing” data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. Results All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; Pcreatinine) (15.3%) versus eGFR 75 (40.5%; P<0.001). Multiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Conclusions Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods. PMID:23037980

  7. Learning-Based Adaptive Imputation Methodwith kNN Algorithm for Missing Power Data

    Directory of Open Access Journals (Sweden)

    Minkyung Kim

    2017-10-01

    Full Text Available This paper proposes a learning-based adaptive imputation method (LAI for imputing missing power data in an energy system. This method estimates the missing power data by using the pattern that appears in the collected data. Here, in order to capture the patterns from past power data, we newly model a feature vector by using past data and its variations. The proposed LAI then learns the optimal length of the feature vector and the optimal historical length, which are significant hyper parameters of the proposed method, by utilizing intentional missing data. Based on a weighted distance between feature vectors representing a missing situation and past situation, missing power data are estimated by referring to the k most similar past situations in the optimal historical length. We further extend the proposed LAI to alleviate the effect of unexpected variation in power data and refer to this new approach as the extended LAI method (eLAI. The eLAI selects a method between linear interpolation (LI and the proposed LAI to improve accuracy under unexpected variations. Finally, from a simulation under various energy consumption profiles, we verify that the proposed eLAI achieves about a 74% reduction of the average imputation error in an energy system, compared to the existing imputation methods.

  8. An Improved Fuzzy Based Missing Value Estimation in DNA Microarray Validated by Gene Ranking

    Directory of Open Access Journals (Sweden)

    Sujay Saha

    2016-01-01

    Full Text Available Most of the gene expression data analysis algorithms require the entire gene expression matrix without any missing values. Hence, it is necessary to devise methods which would impute missing data values accurately. There exist a number of imputation algorithms to estimate those missing values. This work starts with a microarray dataset containing multiple missing values. We first apply the modified version of the fuzzy theory based existing method LRFDVImpute to impute multiple missing values of time series gene expression data and then validate the result of imputation by genetic algorithm (GA based gene ranking methodology along with some regular statistical validation techniques, like RMSE method. Gene ranking, as far as our knowledge, has not been used yet to validate the result of missing value estimation. Firstly, the proposed method has been tested on the very popular Spellman dataset and results show that error margins have been drastically reduced compared to some previous works, which indirectly validates the statistical significance of the proposed method. Then it has been applied on four other 2-class benchmark datasets, like Colorectal Cancer tumours dataset (GDS4382, Breast Cancer dataset (GSE349-350, Prostate Cancer dataset, and DLBCL-FL (Leukaemia for both missing value estimation and ranking the genes, and the results show that the proposed method can reach 100% classification accuracy with very few dominant genes, which indirectly validates the biological significance of the proposed method.

  9. The multiple imputation method: a case study involving secondary data analysis.

    Science.gov (United States)

    Walani, Salimah R; Cleland, Charles M

    2015-05-01

    To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.

  10. [Imputing missing data in public health: general concepts and application to dichotomous variables].

    Science.gov (United States)

    Hernández, Gilma; Moriña, David; Navarro, Albert

    The presence of missing data in collected variables is common in health surveys, but the subsequent imputation thereof at the time of analysis is not. Working with imputed data may have certain benefits regarding the precision of the estimators and the unbiased identification of associations between variables. The imputation process is probably still little understood by many non-statisticians, who view this process as highly complex and with an uncertain goal. To clarify these questions, this note aims to provide a straightforward, non-exhaustive overview of the imputation process to enable public health researchers ascertain its strengths. All this in the context of dichotomous variables which are commonplace in public health. To illustrate these concepts, an example in which missing data is handled by means of simple and multiple imputation is introduced. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.

  11. On multivariate imputation and forecasting of decadal wind speed missing data.

    Science.gov (United States)

    Wesonga, Ronald

    2015-01-01

    This paper demonstrates the application of multiple imputations by chained equations and time series forecasting of wind speed data. The study was motivated by the high prevalence of missing wind speed historic data. Findings based on the fully conditional specification under multiple imputations by chained equations, provided reliable wind speed missing data imputations. Further, the forecasting model shows, the smoothing parameter, alpha (0.014) close to zero, confirming that recent past observations are more suitable for use to forecast wind speeds. The maximum decadal wind speed for Entebbe International Airport was estimated to be 17.6 metres per second at a 0.05 level of significance with a bound on the error of estimation of 10.8 metres per second. The large bound on the error of estimations confirms the dynamic tendencies of wind speed at the airport under study.

  12. Defining, evaluating, and removing bias induced by linear imputation in longitudinal clinical trials with MNAR missing data.

    Science.gov (United States)

    Helms, Ronald W; Reece, Laura Helms; Helms, Russell W; Helms, Mary W

    2011-03-01

    Missing not at random (MNAR) post-dropout missing data from a longitudinal clinical trial result in the collection of "biased data," which leads to biased estimators and tests of corrupted hypotheses. In a full rank linear model analysis the model equation, E[Y] = Xβ, leads to the definition of the primary parameter β = (X'X)(-1)X'E[Y], and the definition of linear secondary parameters of the form θ = Lβ = L(X'X)(-1)X'E[Y], including, for example, a parameter representing a "treatment effect." These parameters depend explicitly on E[Y], which raises the questions: What is E[Y] when some elements of the incomplete random vector Y are not observed and MNAR, or when such a Y is "completed" via imputation? We develop a rigorous, readily interpretable definition of E[Y] in this context that leads directly to definitions of β, Bias(β) = E[β] - β, Bias(θ) = E[θ] - Lβ, and the extent of hypothesis corruption. These definitions provide a basis for evaluating, comparing, and removing biases induced by various linear imputation methods for MNAR incomplete data from longitudinal clinical trials. Linear imputation methods use earlier data from a subject to impute values for post-dropout missing values and include "Last Observation Carried Forward" (LOCF) and "Baseline Observation Carried Forward" (BOCF), among others. We illustrate the methods of evaluating, comparing, and removing biases and the effects of testing corresponding corrupted hypotheses via a hypothetical but very realistic longitudinal analgesic clinical trial.

  13. Comparison of results from different imputation techniques for missing data from an anti-obesity drug trial

    DEFF Research Database (Denmark)

    Jørgensen, Anders W.; Lundstrøm, Lars H; Wetterslev, Jørn

    2014-01-01

    BACKGROUND: In randomised trials of medical interventions, the most reliable analysis follows the intention-to-treat (ITT) principle. However, the ITT analysis requires that missing outcome data have to be imputed. Different imputation techniques may give different results and some may lead to bias...... of handling missing data in a 60-week placebo controlled anti-obesity drug trial on topiramate. METHODS: We compared an analysis of complete cases with datasets where missing body weight measurements had been replaced using three different imputation methods: LOCF, baseline carried forward (BOCF) and MI...

  14. Missing in space: an evaluation of imputation methods for missing data in spatial analysis of risk factors for type II diabetes.

    Science.gov (United States)

    Baker, Jannah; White, Nicole; Mengersen, Kerrie

    2014-11-20

    Spatial analysis is increasingly important for identifying modifiable geographic risk factors for disease. However, spatial health data from surveys are often incomplete, ranging from missing data for only a few variables, to missing data for many variables. For spatial analyses of health outcomes, selection of an appropriate imputation method is critical in order to produce the most accurate inferences. We present a cross-validation approach to select between three imputation methods for health survey data with correlated lifestyle covariates, using as a case study, type II diabetes mellitus (DM II) risk across 71 Queensland Local Government Areas (LGAs). We compare the accuracy of mean imputation to imputation using multivariate normal and conditional autoregressive prior distributions. Choice of imputation method depends upon the application and is not necessarily the most complex method. Mean imputation was selected as the most accurate method in this application. Selecting an appropriate imputation method for health survey data, after accounting for spatial correlation and correlation between covariates, allows more complete analysis of geographic risk factors for disease with more confidence in the results to inform public policy decision-making.

  15. Imputation by the mean score should be avoided when validating a Patient Reported Outcomes questionnaire by a Rasch model in presence of informative missing data

    LENUS (Irish Health Repository)

    Hardouin, Jean-Benoit

    2011-07-14

    Abstract Background Nowadays, more and more clinical scales consisting in responses given by the patients to some items (Patient Reported Outcomes - PRO), are validated with models based on Item Response Theory, and more specifically, with a Rasch model. In the validation sample, presence of missing data is frequent. The aim of this paper is to compare sixteen methods for handling the missing data (mainly based on simple imputation) in the context of psychometric validation of PRO by a Rasch model. The main indexes used for validation by a Rasch model are compared. Methods A simulation study was performed allowing to consider several cases, notably the possibility for the missing values to be informative or not and the rate of missing data. Results Several imputations methods produce bias on psychometrical indexes (generally, the imputation methods artificially improve the psychometric qualities of the scale). In particular, this is the case with the method based on the Personal Mean Score (PMS) which is the most commonly used imputation method in practice. Conclusions Several imputation methods should be avoided, in particular PMS imputation. From a general point of view, it is important to use an imputation method that considers both the ability of the patient (measured for example by his\\/her score), and the difficulty of the item (measured for example by its rate of favourable responses). Another recommendation is to always consider the addition of a random process in the imputation method, because such a process allows reducing the bias. Last, the analysis realized without imputation of the missing data (available case analyses) is an interesting alternative to the simple imputation in this context.

  16. Improving cluster-based missing value estimation of DNA microarray data.

    Science.gov (United States)

    Brás, Lígia P; Menezes, José C

    2007-06-01

    We present a modification of the weighted K-nearest neighbours imputation method (KNNimpute) for missing values (MVs) estimation in microarray data based on the reuse of estimated data. The method was called iterative KNN imputation (IKNNimpute) as the estimation is performed iteratively using the recently estimated values. The estimation efficiency of IKNNimpute was assessed under different conditions (data type, fraction and structure of missing data) by the normalized root mean squared error (NRMSE) and the correlation coefficients between estimated and true values, and compared with that of other cluster-based estimation methods (KNNimpute and sequential KNN). We further investigated the influence of imputation on the detection of differentially expressed genes using SAM by examining the differentially expressed genes that are lost after MV estimation. The performance measures give consistent results, indicating that the iterative procedure of IKNNimpute can enhance the prediction ability of cluster-based methods in the presence of high missing rates, in non-time series experiments and in data sets comprising both time series and non-time series data, because the information of the genes having MVs is used more efficiently and the iterative procedure allows refining the MV estimates. More importantly, IKNN has a smaller detrimental effect on the detection of differentially expressed genes.

  17. TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION.

    Science.gov (United States)

    Allen, Genevera I; Tibshirani, Robert

    2010-06-01

    Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable , meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal , in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.

  18. TRIP: An interactive retrieving-inferring data imputation approach

    KAUST Repository

    Li, Zhixu

    2016-06-25

    Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1]. © 2016 IEEE.

  19. TRIP: An interactive retrieving-inferring data imputation approach

    KAUST Repository

    Li, Zhixu; Qin, Lu; Cheng, Hong; Zhang, Xiangliang; Zhou, Xiaofang

    2016-01-01

    Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1]. © 2016 IEEE.

  20. R package imputeTestbench to compare imputations methods for univariate time series

    OpenAIRE

    Bokde, Neeraj; Kulat, Kishore; Beck, Marcus W; Asencio-Cortés, Gualberto

    2016-01-01

    This paper describes the R package imputeTestbench that provides a testbench for comparing imputation methods for missing data in univariate time series. The imputeTestbench package can be used to simulate the amount and type of missing data in a complete dataset and compare filled data using different imputation methods. The user has the option to simulate missing data by removing observations completely at random or in blocks of different sizes. Several default imputation methods are includ...

  1. Comparison of methods for dealing with missing values in the EPV-R.

    Science.gov (United States)

    Paniagua, David; Amor, Pedro J; Echeburúa, Enrique; Abad, Francisco J

    2017-08-01

    The development of an effective instrument to assess the risk of partner violence is a topic of great social relevance. This study evaluates the scale of “Predicción del Riesgo de Violencia Grave Contra la Pareja” –Revisada– (EPV-R - Severe Intimate Partner Violence Risk Prediction Scale-Revised), a tool developed in Spain, which is facing the problem of how to treat the high rate of missing values, as is usual in this type of scale. First, responses to the EPV-R in a sample of 1215 male abusers who were reported to the police were used to analyze the patterns of occurrence of missing values, as well as the factor structure. Second, we analyzed the performance of various imputation methods using simulated data that emulates the missing data mechanism found in the empirical database. The imputation procedure originally proposed by the authors of the scale provides acceptable results, although the application of a method based on the Item Response Theory could provide greater accuracy and offers some additional advantages. Item Response Theory appears to be a useful tool for imputing missing data in this type of questionnaire.

  2. A web-based approach to data imputation

    KAUST Repository

    Li, Zhixu

    2013-10-24

    In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques. © 2013 Springer Science+Business Media New York.

  3. iVAR: a program for imputing missing data in multivariate time series using vector autoregressive models.

    Science.gov (United States)

    Liu, Siwei; Molenaar, Peter C M

    2014-12-01

    This article introduces iVAR, an R program for imputing missing data in multivariate time series on the basis of vector autoregressive (VAR) models. We conducted a simulation study to compare iVAR with three methods for handling missing data: listwise deletion, imputation with sample means and variances, and multiple imputation ignoring time dependency. The results showed that iVAR produces better estimates for the cross-lagged coefficients than do the other three methods. We demonstrate the use of iVAR with an empirical example of time series electrodermal activity data and discuss the advantages and limitations of the program.

  4. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.

    Science.gov (United States)

    Liu, Yuzhe; Gopalakrishnan, Vanathi

    2017-03-01

    Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.

  5. Missing data in clinical trials: control-based mean imputation and sensitivity analysis.

    Science.gov (United States)

    Mehrotra, Devan V; Liu, Fang; Permutt, Thomas

    2017-09-01

    In some randomized (drug versus placebo) clinical trials, the estimand of interest is the between-treatment difference in population means of a clinical endpoint that is free from the confounding effects of "rescue" medication (e.g., HbA1c change from baseline at 24 weeks that would be observed without rescue medication regardless of whether or when the assigned treatment was discontinued). In such settings, a missing data problem arises if some patients prematurely discontinue from the trial or initiate rescue medication while in the trial, the latter necessitating the discarding of post-rescue data. We caution that the commonly used mixed-effects model repeated measures analysis with the embedded missing at random assumption can deliver an exaggerated estimate of the aforementioned estimand of interest. This happens, in part, due to implicit imputation of an overly optimistic mean for "dropouts" (i.e., patients with missing endpoint data of interest) in the drug arm. We propose an alternative approach in which the missing mean for the drug arm dropouts is explicitly replaced with either the estimated mean of the entire endpoint distribution under placebo (primary analysis) or a sequence of increasingly more conservative means within a tipping point framework (sensitivity analysis); patient-level imputation is not required. A supplemental "dropout = failure" analysis is considered in which a common poor outcome is imputed for all dropouts followed by a between-treatment comparison using quantile regression. All analyses address the same estimand and can adjust for baseline covariates. Three examples and simulation results are used to support our recommendations. Copyright © 2017 John Wiley & Sons, Ltd.

  6. Handling missing values in the MDS-UPDRS.

    Science.gov (United States)

    Goetz, Christopher G; Luo, Sheng; Wang, Lu; Tilley, Barbara C; LaPelle, Nancy R; Stebbins, Glenn T

    2015-10-01

    This study was undertaken to define the number of missing values permissible to render valid total scores for each Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS) part. To handle missing values, imputation strategies serve as guidelines to reject an incomplete rating or create a surrogate score. We tested a rigorous, scale-specific, data-based approach to handling missing values for the MDS-UPDRS. From two large MDS-UPDRS datasets, we sequentially deleted item scores, either consistently (same items) or randomly (different items) across all subjects. Lin's Concordance Correlation Coefficient (CCC) compared scores calculated without missing values with prorated scores based on sequentially increasing missing values. The maximal number of missing values retaining a CCC greater than 0.95 determined the threshold for rendering a valid prorated score. A second confirmatory sample was selected from the MDS-UPDRS international translation program. To provide valid part scores applicable across all Hoehn and Yahr (H&Y) stages when the same items are consistently missing, one missing item from Part I, one from Part II, three from Part III, but none from Part IV can be allowed. To provide valid part scores applicable across all H&Y stages when random item entries are missing, one missing item from Part I, two from Part II, seven from Part III, but none from Part IV can be allowed. All cutoff values were confirmed in the validation sample. These analyses are useful for constructing valid surrogate part scores for MDS-UPDRS when missing items fall within the identified threshold and give scientific justification for rejecting partially completed ratings that fall below the threshold. © 2015 International Parkinson and Movement Disorder Society.

  7. A suggested approach for imputation of missing dietary data for young children in daycare.

    Science.gov (United States)

    Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P; Zeng, Donglin; Vaughn, Amber E; Pratt, Charlotte; Ward, Dianne S

    2015-01-01

    Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls). Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES); lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI). From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES)] ratio among non-daycare children on weekdays and the L/(B+D+ES) ratio for all children on weekends. Daytime snack data were used to impute snacks. The reported mean (± standard deviation) weekday intake was lower for daycare children [725 (±324) kcal] compared to non-daycare children [1,048 (±463) kcal]. Weekend intake for all children was 1,173 (±427) kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409) kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.

  8. Effects of Different Missing Data Imputation Techniques on the Performance of Undiagnosed Diabetes Risk Prediction Models in a Mixed-Ancestry Population of South Africa.

    Directory of Open Access Journals (Sweden)

    Katya L Masconi

    Full Text Available Imputation techniques used to handle missing data are based on the principle of replacement. It is widely advocated that multiple imputation is superior to other imputation methods, however studies have suggested that simple methods for filling missing data can be just as accurate as complex methods. The objective of this study was to implement a number of simple and more complex imputation methods, and assess the effect of these techniques on the performance of undiagnosed diabetes risk prediction models during external validation.Data from the Cape Town Bellville-South cohort served as the basis for this study. Imputation methods and models were identified via recent systematic reviews. Models' discrimination was assessed and compared using C-statistic and non-parametric methods, before and after recalibration through simple intercept adjustment.The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4% had missing data. Family history had the highest proportion of missing data (25%. Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals. Overall, deletion resulted in the lowest model performances while simple imputation yielded the highest C-statistic for the Cambridge Diabetes Risk model, Kuwaiti Risk model, Omani Diabetes Risk model and Rotterdam Predictive model. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model, which were matched by simpler imputation methods.Deletion was confirmed as a poor technique for handling missing data. However, despite the emphasized disadvantages of simpler imputation methods, this study showed that implementing these methods results in similar predictive utility for undiagnosed diabetes when compared to multiple imputation.

  9. Autoregressive-model-based missing value estimation for DNA microarray time series data.

    Science.gov (United States)

    Choong, Miew Keen; Charbit, Maurice; Yan, Hong

    2009-01-01

    Missing value estimation is important in DNA microarray data analysis. A number of algorithms have been developed to solve this problem, but they have several limitations. Most existing algorithms are not able to deal with the situation where a particular time point (column) of the data is missing entirely. In this paper, we present an autoregressive-model-based missing value estimation method (ARLSimpute) that takes into account the dynamic property of microarray temporal data and the local similarity structures in the data. ARLSimpute is especially effective for the situation where a particular time point contains many missing values or where the entire time point is missing. Experiment results suggest that our proposed algorithm is an accurate missing value estimator in comparison with other imputation methods on simulated as well as real microarray time series datasets.

  10. Missing Value Imputation Improves Mortality Risk Prediction Following Cardiac Surgery: An Investigation of an Australian Patient Cohort.

    Science.gov (United States)

    Karim, Md Nazmul; Reid, Christopher M; Tran, Lavinia; Cochrane, Andrew; Billah, Baki

    2017-03-01

    The aim of this study was to evaluate the impact of missing values on the prediction performance of the model predicting 30-day mortality following cardiac surgery as an example. Information from 83,309 eligible patients, who underwent cardiac surgery, recorded in the Australia and New Zealand Society of Cardiac and Thoracic Surgeons (ANZSCTS) database registry between 2001 and 2014, was used. An existing 30-day mortality risk prediction model developed from ANZSCTS database was re-estimated using the complete cases (CC) analysis and using multiple imputation (MI) analysis. Agreement between the risks generated by the CC and MI analysis approaches was assessed by the Bland-Altman method. Performances of the two models were compared. One or more missing predictor variables were present in 15.8% of the patients in the dataset. The Bland-Altman plot demonstrated significant disagreement between the risk scores (prisk of mortality. Compared to CC analysis, MI analysis resulted in an average of 8.5% decrease in standard error, a measure of uncertainty. The MI model provided better prediction of mortality risk (observed: 2.69%; MI: 2.63% versus CC: 2.37%, Pvalues improved the 30-day mortality risk prediction following cardiac surgery. Copyright © 2016 Australian and New Zealand Society of Cardiac and Thoracic Surgeons (ANZSCTS) and the Cardiac Society of Australia and New Zealand (CSANZ). Published by Elsevier B.V. All rights reserved.

  11. A suggested approach for imputation of missing dietary data for young children in daycare

    Directory of Open Access Journals (Sweden)

    June Stevens

    2015-12-01

    Full Text Available Background: Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. Objective: The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Design: Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls. Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES; lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI. From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES] ratio among non-daycare children on weekdays and the L/(B+D+ES ratio for all children on weekends. Daytime snack data were used to impute snacks. Results: The reported mean (± standard deviation weekday intake was lower for daycare children [725 (±324 kcal] compared to non-daycare children [1,048 (±463 kcal]. Weekend intake for all children was 1,173 (±427 kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409 kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. Conclusion: This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.

  12. Effect of imputing markers from a low-density chip on the reliability of genomic breeding values in Holstein populations

    DEFF Research Database (Denmark)

    Dassonneville, R; Brøndum, Rasmus Froberg; Druet, T

    2011-01-01

    The purpose of this study was to investigate the imputation error and loss of reliability of direct genomic values (DGV) or genomically enhanced breeding values (GEBV) when using genotypes imputed from a 3,000-marker single nucleotide polymorphism (SNP) panel to a 50,000-marker SNP panel. Data...... of missing markers and prediction of breeding values were performed using 2 different reference populations in each country: either a national reference population or a combined EuroGenomics reference population. Validation for accuracy of imputation and genomic prediction was done based on national test...... with a national reference data set gave an absolute loss of 0.05 in mean reliability of GEBV in the French study, whereas a loss of 0.03 was obtained for reliability of DGV in the Nordic study. When genotypes were imputed using the EuroGenomics reference, a loss of 0.02 in mean reliability of GEBV was detected...

  13. Multiple Improvements of Multiple Imputation Likelihood Ratio Tests

    OpenAIRE

    Chan, Kin Wai; Meng, Xiao-Li

    2017-01-01

    Multiple imputation (MI) inference handles missing data by first properly imputing the missing values $m$ times, and then combining the $m$ analysis results from applying a complete-data procedure to each of the completed datasets. However, the existing method for combining likelihood ratio tests has multiple defects: (i) the combined test statistic can be negative in practice when the reference null distribution is a standard $F$ distribution; (ii) it is not invariant to re-parametrization; ...

  14. Missing-Values Adjustment for Mixed-Type Data

    Directory of Open Access Journals (Sweden)

    Agostino Tarsitano

    2011-01-01

    data sets involving a mixture of numeric, ordinal, binary, and categorical variables. Our technique is a variation of the popular nearest neighbor hot deck imputation (NNHDI where “nearest” is defined in terms of a global distance obtained as a convex combination of the distance matrices computed for the various types of variables. We address the problem of proper weighting of the partial distance matrices in order to reflect their significance, reliability, and statistical adequacy. Performance of several weighting schemes is compared under a variety of settings in coordination with imputation of the least power mean of the Box-Cox transformation applied to the values of the donors. Through analysis of simulated and actual data sets, we will show that this approach is appropriate. Our main contribution has been to demonstrate that mixed data may optimally be combined to allow the accurate reconstruction of missing values in the target variable even when some data are absent from the other fields of the record.

  15. Accuracy of hemoglobin A1c imputation using fasting plasma glucose in diabetes research using electronic health records data

    Directory of Open Access Journals (Sweden)

    Stanley Xu

    2014-05-01

    Full Text Available In studies that use electronic health record data, imputation of important data elements such as Glycated hemoglobin (A1c has become common. However, few studies have systematically examined the validity of various imputation strategies for missing A1c values. We derived a complete dataset using an incident diabetes population that has no missing values in A1c, fasting and random plasma glucose (FPG and RPG, age, and gender. We then created missing A1c values under two assumptions: missing completely at random (MCAR and missing at random (MAR. We then imputed A1c values, compared the imputed values to the true A1c values, and used these data to assess the impact of A1c on initiation of antihyperglycemic therapy. Under MCAR, imputation of A1c based on FPG 1 estimated a continuous A1c within ± 1.88% of the true A1c 68.3% of the time; 2 estimated a categorical A1c within ± one category from the true A1c about 50% of the time. Including RPG in imputation slightly improved the precision but did not improve the accuracy. Under MAR, including gender and age in addition to FPG improved the accuracy of imputed continuous A1c but not categorical A1c. Moreover, imputation of up to 33% of missing A1c values did not change the accuracy and precision and did not alter the impact of A1c on initiation of antihyperglycemic therapy. When using A1c values as a predictor variable, a simple imputation algorithm based only on age, sex, and fasting plasma glucose gave acceptable results.

  16. Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.

    Science.gov (United States)

    Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha

    2015-12-01

    Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length. Copyright © 2015 Elsevier Inc. All rights reserved.

  17. Multiple imputation in the presence of non-normal data.

    Science.gov (United States)

    Lee, Katherine J; Carlin, John B

    2017-02-20

    Multiple imputation (MI) is becoming increasingly popular for handling missing data. Standard approaches for MI assume normality for continuous variables (conditionally on the other variables in the imputation model). However, it is unclear how to impute non-normally distributed continuous variables. Using simulation and a case study, we compared various transformations applied prior to imputation, including a novel non-parametric transformation, to imputation on the raw scale and using predictive mean matching (PMM) when imputing non-normal data. We generated data from a range of non-normal distributions, and set 50% to missing completely at random or missing at random. We then imputed missing values on the raw scale, following a zero-skewness log, Box-Cox or non-parametric transformation and using PMM with both type 1 and 2 matching. We compared inferences regarding the marginal mean of the incomplete variable and the association with a fully observed outcome. We also compared results from these approaches in the analysis of depression and anxiety symptoms in parents of very preterm compared with term-born infants. The results provide novel empirical evidence that the decision regarding how to impute a non-normal variable should be based on the nature of the relationship between the variables of interest. If the relationship is linear in the untransformed scale, transformation can introduce bias irrespective of the transformation used. However, if the relationship is non-linear, it may be important to transform the variable to accurately capture this relationship. A useful alternative is to impute the variable using PMM with type 1 matching. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  18. Recurrent Neural Networks for Multivariate Time Series with Missing Values.

    Science.gov (United States)

    Che, Zhengping; Purushotham, Sanjay; Cho, Kyunghyun; Sontag, David; Liu, Yan

    2018-04-17

    Multivariate time series data in practical applications, such as health care, geoscience, and biology, are characterized by a variety of missing values. In time series prediction and other related tasks, it has been noted that missing values and their missing patterns are often correlated with the target labels, a.k.a., informative missingness. There is very limited work on exploiting the missing patterns for effective imputation and improving prediction performance. In this paper, we develop novel deep learning models, namely GRU-D, as one of the early attempts. GRU-D is based on Gated Recurrent Unit (GRU), a state-of-the-art recurrent neural network. It takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. Experiments of time series classification tasks on real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that our models achieve state-of-the-art performance and provide useful insights for better understanding and utilization of missing values in time series analysis.

  19. Handling missing data for the identification of charged particles in a multilayer detector: A comparison between different imputation methods

    Energy Technology Data Exchange (ETDEWEB)

    Riggi, S., E-mail: sriggi@oact.inaf.it [INAF - Osservatorio Astrofisico di Catania (Italy); Riggi, D. [Keras Strategy - Milano (Italy); Riggi, F. [Dipartimento di Fisica e Astronomia - Università di Catania (Italy); INFN, Sezione di Catania (Italy)

    2015-04-21

    Identification of charged particles in a multilayer detector by the energy loss technique may also be achieved by the use of a neural network. The performance of the network becomes worse when a large fraction of information is missing, for instance due to detector inefficiencies. Algorithms which provide a way to impute missing information have been developed over the past years. Among the various approaches, we focused on normal mixtures’ models in comparison with standard mean imputation and multiple imputation methods. Further, to account for the intrinsic asymmetry of the energy loss data, we considered skew-normal mixture models and provided a closed form implementation in the Expectation-Maximization (EM) algorithm framework to handle missing patterns. The method has been applied to a test case where the energy losses of pions, kaons and protons in a six-layers’ Silicon detector are considered as input neurons to a neural network. Results are given in terms of reconstruction efficiency and purity of the various species in different momentum bins.

  20. BRITS: Bidirectional Recurrent Imputation for Time Series

    OpenAIRE

    Cao, Wei; Wang, Dong; Li, Jian; Zhou, Hao; Li, Lei; Li, Yitan

    2018-01-01

    Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing va...

  1. Time Series Imputation via L1 Norm-Based Singular Spectrum Analysis

    Science.gov (United States)

    Kalantari, Mahdi; Yarmohammadi, Masoud; Hassani, Hossein; Silva, Emmanuel Sirimal

    Missing values in time series data is a well-known and important problem which many researchers have studied extensively in various fields. In this paper, a new nonparametric approach for missing value imputation in time series is proposed. The main novelty of this research is applying the L1 norm-based version of Singular Spectrum Analysis (SSA), namely L1-SSA which is robust against outliers. The performance of the new imputation method has been compared with many other established methods. The comparison is done by applying them to various real and simulated time series. The obtained results confirm that the SSA-based methods, especially L1-SSA can provide better imputation in comparison to other methods.

  2. Comparison of different Methods for Univariate Time Series Imputation in R

    OpenAIRE

    Moritz, Steffen; Sardá, Alexis; Bartz-Beielstein, Thomas; Zaefferer, Martin; Stork, Jörg

    2015-01-01

    Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of ...

  3. Handling missing data in cluster randomized trials: A demonstration of multiple imputation with PAN through SAS

    Directory of Open Access Journals (Sweden)

    Jiangxiu Zhou

    2014-09-01

    Full Text Available The purpose of this study is to demonstrate a way of dealing with missing data in clustered randomized trials by doing multiple imputation (MI with the PAN package in R through SAS. The procedure for doing MI with PAN through SAS is demonstrated in detail in order for researchers to be able to use this procedure with their own data. An illustration of the technique with empirical data was also included. In this illustration thePAN results were compared with pairwise deletion and three types of MI: (1 Normal Model (NM-MI ignoring the cluster structure; (2 NM-MI with dummy-coded cluster variables (fixed cluster structure; and (3 a hybrid NM-MI which imputes half the time ignoring the cluster structure, and the other half including the dummy-coded cluster variables. The empirical analysis showed that using PAN and the other strategies produced comparable parameter estimates. However, the dummy-coded MI overestimated the intraclass correlation, whereas MI ignoring the cluster structure and the hybrid MI underestimated the intraclass correlation. When compared with PAN, the p-value and standard error for the treatment effect were higher with dummy-coded MI, and lower with MI ignoring the clusterstructure, the hybrid MI approach, and pairwise deletion. Previous studies have shown that NM-MI is not appropriate for handling missing data in clustered randomized trials. This approach, in addition to the pairwise deletion approach, leads to a biased intraclass correlation and faultystatistical conclusions. Imputation in clustered randomized trials should be performed with PAN. We have demonstrated an easy way for using PAN through SAS.

  4. Data imputation analysis for Cosmic Rays time series

    Science.gov (United States)

    Fernandes, R. C.; Lucio, P. S.; Fernandez, J. H.

    2017-05-01

    The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechanical and human failure or technical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was used. Three different methods for monthly dataset imputation were selected: AMÉLIA II - runs the bootstrap Expectation Maximization algorithm, MICE - runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI - an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test and t-test. The results showed that for CLMX and ROME, the R2 and R statistics were equal to 0.98 and 0.96, respectively. It was observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.

  5. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation.

    Science.gov (United States)

    Välikangas, Tommi; Suomi, Tomi; Elo, Laura L

    2017-05-31

    Label-free mass spectrometry (MS) has developed into an important tool applied in various fields of biological and life sciences. Several software exist to process the raw MS data into quantified protein abundances, including open source and commercial solutions. Each software includes a set of unique algorithms for different tasks of the MS data processing workflow. While many of these algorithms have been compared separately, a thorough and systematic evaluation of their overall performance is missing. Moreover, systematic information is lacking about the amount of missing values produced by the different proteomics software and the capabilities of different data imputation methods to account for them.In this study, we evaluated the performance of five popular quantitative label-free proteomics software workflows using four different spike-in data sets. Our extensive testing included the number of proteins quantified and the number of missing values produced by each workflow, the accuracy of detecting differential expression and logarithmic fold change and the effect of different imputation and filtering methods on the differential expression results. We found that the Progenesis software performed consistently well in the differential expression analysis and produced few missing values. The missing values produced by the other software decreased their performance, but this difference could be mitigated using proper data filtering or imputation methods. Among the imputation methods, we found that the local least squares (lls) regression imputation consistently increased the performance of the software in the differential expression analysis, and a combination of both data filtering and local least squares imputation increased performance the most in the tested data sets. © The Author 2017. Published by Oxford University Press.

  6. The use of the bootstrap in the analysis of case-control studies with missing data

    DEFF Research Database (Denmark)

    Siersma, Volkert Dirk; Johansen, Christoffer

    2004-01-01

    nonparametric bootstrap, bootstrap confidence intervals, missing values, multiple imputation, matched case-control study......nonparametric bootstrap, bootstrap confidence intervals, missing values, multiple imputation, matched case-control study...

  7. Lazy collaborative filtering for data sets with missing values.

    Science.gov (United States)

    Ren, Yongli; Li, Gang; Zhang, Jun; Zhou, Wanlei

    2013-12-01

    As one of the biggest challenges in research on recommender systems, the data sparsity issue is mainly caused by the fact that users tend to rate a small proportion of items from the huge number of available items. This issue becomes even more problematic for the neighborhood-based collaborative filtering (CF) methods, as there are even lower numbers of ratings available in the neighborhood of the query item. In this paper, we aim to address the data sparsity issue in the context of neighborhood-based CF. For a given query (user, item), a set of key ratings is first identified by taking the historical information of both the user and the item into account. Then, an auto-adaptive imputation (AutAI) method is proposed to impute the missing values in the set of key ratings. We present a theoretical analysis to show that the proposed imputation method effectively improves the performance of the conventional neighborhood-based CF methods. The experimental results show that our new method of CF with AutAI outperforms six existing recommendation methods in terms of accuracy.

  8. Dealing with gene expression missing data.

    Science.gov (United States)

    Brás, L P; Menezes, J C

    2006-05-01

    Compared evaluation of different methods is presented for estimating missing values in microarray data: weighted K-nearest neighbours imputation (KNNimpute), regression-based methods such as local least squares imputation (LLSimpute) and partial least squares imputation (PLSimpute) and Bayesian principal component analysis (BPCA). The influence in prediction accuracy of some factors, such as methods' parameters, type of data relationships used in the estimation process (i.e. row-wise, column-wise or both), missing rate and pattern and type of experiment [time series (TS), non-time series (NTS) or mixed (MIX) experiments] is elucidated. Improvements based on the iterative use of data (iterative LLS and PLS imputation--ILLSimpute and IPLSimpute), the need to perform initial imputations (modified PLS and Helland PLS imputation--MPLSimpute and HPLSimpute) and the type of relationships employed (KNNarray, LLSarray, HPLSarray and alternating PLS--APLSimpute) are proposed. Overall, it is shown that data set properties (type of experiment, missing rate and pattern) affect the data similarity structure, therefore influencing the methods' performance. LLSimpute and ILLSimpute are preferable in the presence of data with a stronger similarity structure (TS and MIX experiments), whereas PLS-based methods (MPLSimpute, IPLSimpute and APLSimpute) are preferable when estimating NTS missing data.

  9. Scalable Data Quality for Big Data: The Pythia Framework for Handling Missing Values.

    Science.gov (United States)

    Cahsai, Atoshum; Anagnostopoulos, Christos; Triantafillou, Peter

    2015-09-01

    Solving the missing-value (MV) problem with small estimation errors in large-scale data environments is a notoriously resource-demanding task. The most widely used MV imputation approaches are computationally expensive because they explicitly depend on the volume and the dimension of the data. Moreover, as datasets and their user community continuously grow, the problem can only be exacerbated. In an attempt to deal with such a problem, in our previous work, we introduced a novel framework coined Pythia, which employs a number of distributed data nodes (cohorts), each of which contains a partition of the original dataset. To perform MV imputation, the Pythia, based on specific machine and statistical learning structures (signatures), selects the most appropriate subset of cohorts to perform locally a missing value substitution algorithm (MVA). This selection relies on the principle that particular subset of cohorts maintains the most relevant partition of the dataset. In addition to this, as Pythia uses only part of the dataset for imputation and accesses different cohorts in parallel, it improves efficiency, scalability, and accuracy compared to a single machine (coined Godzilla), which uses the entire massive dataset to compute imputation requests. Although this article is an extension of our previous work, we particularly investigate the robustness of the Pythia framework and show that the Pythia is independent from any MVA and signature construction algorithms. In order to facilitate our research, we considered two well-known MVAs (namely K-nearest neighbor and expectation-maximization imputation algorithms), as well as two machine and neural computational learning signature construction algorithms based on adaptive vector quantization and competitive learning. We prove comprehensive experiments to assess the performance of the Pythia against Godzilla and showcase the benefits stemmed from this framework.

  10. Estimation of missing values in solar radiation data using piecewise interpolation methods: Case study at Penang city

    International Nuclear Information System (INIS)

    Zainudin, Mohd Lutfi; Saaban, Azizan; Bakar, Mohd Nazari Abu

    2015-01-01

    The solar radiation values have been composed by automatic weather station using the device that namely pyranometer. The device is functions to records all the radiation values that have been dispersed, and these data are very useful for it experimental works and solar device’s development. In addition, for modeling and designing on solar radiation system application is needed for complete data observation. Unfortunately, lack for obtained the complete solar radiation data frequently occur due to several technical problems, which mainly contributed by monitoring device. Into encountering this matter, estimation missing values in an effort to substitute absent values with imputed data. This paper aimed to evaluate several piecewise interpolation techniques likes linear, splines, cubic, and nearest neighbor into dealing missing values in hourly solar radiation data. Then, proposed an extendable work into investigating the potential used of cubic Bezier technique and cubic Said-ball method as estimator tools. As result, methods for cubic Bezier and Said-ball perform the best compare to another piecewise imputation technique

  11. Estimation of missing values in solar radiation data using piecewise interpolation methods: Case study at Penang city

    Science.gov (United States)

    Zainudin, Mohd Lutfi; Saaban, Azizan; Bakar, Mohd Nazari Abu

    2015-12-01

    The solar radiation values have been composed by automatic weather station using the device that namely pyranometer. The device is functions to records all the radiation values that have been dispersed, and these data are very useful for it experimental works and solar device's development. In addition, for modeling and designing on solar radiation system application is needed for complete data observation. Unfortunately, lack for obtained the complete solar radiation data frequently occur due to several technical problems, which mainly contributed by monitoring device. Into encountering this matter, estimation missing values in an effort to substitute absent values with imputed data. This paper aimed to evaluate several piecewise interpolation techniques likes linear, splines, cubic, and nearest neighbor into dealing missing values in hourly solar radiation data. Then, proposed an extendable work into investigating the potential used of cubic Bezier technique and cubic Said-ball method as estimator tools. As result, methods for cubic Bezier and Said-ball perform the best compare to another piecewise imputation technique.

  12. Estimation of missing values in solar radiation data using piecewise interpolation methods: Case study at Penang city

    Energy Technology Data Exchange (ETDEWEB)

    Zainudin, Mohd Lutfi, E-mail: mdlutfi07@gmail.com [School of Quantitative Sciences, UUMCAS, Universiti Utara Malaysia, 06010 Sintok, Kedah (Malaysia); Institut Matematik Kejuruteraan (IMK), Universiti Malaysia Perlis, 02600 Arau, Perlis (Malaysia); Saaban, Azizan, E-mail: azizan.s@uum.edu.my [School of Quantitative Sciences, UUMCAS, Universiti Utara Malaysia, 06010 Sintok, Kedah (Malaysia); Bakar, Mohd Nazari Abu, E-mail: mohdnazari@perlis.uitm.edu.my [Faculty of Applied Science, Universiti Teknologi Mara, 02600 Arau, Perlis (Malaysia)

    2015-12-11

    The solar radiation values have been composed by automatic weather station using the device that namely pyranometer. The device is functions to records all the radiation values that have been dispersed, and these data are very useful for it experimental works and solar device’s development. In addition, for modeling and designing on solar radiation system application is needed for complete data observation. Unfortunately, lack for obtained the complete solar radiation data frequently occur due to several technical problems, which mainly contributed by monitoring device. Into encountering this matter, estimation missing values in an effort to substitute absent values with imputed data. This paper aimed to evaluate several piecewise interpolation techniques likes linear, splines, cubic, and nearest neighbor into dealing missing values in hourly solar radiation data. Then, proposed an extendable work into investigating the potential used of cubic Bezier technique and cubic Said-ball method as estimator tools. As result, methods for cubic Bezier and Said-ball perform the best compare to another piecewise imputation technique.

  13. Which DTW Method Applied to Marine Univariate Time Series Imputation

    OpenAIRE

    Phan , Thi-Thu-Hong; Caillault , Émilie; Lefebvre , Alain; Bigand , André

    2017-01-01

    International audience; Missing data are ubiquitous in any domains of applied sciences. Processing datasets containing missing values can lead to a loss of efficiency and unreliable results, especially for large missing sub-sequence(s). Therefore, the aim of this paper is to build a framework for filling missing values in univariate time series and to perform a comparison of different similarity metrics used for the imputation task. This allows to suggest the most suitable methods for the imp...

  14. On Matrix Sampling and Imputation of Context Questionnaires with Implications for the Generation of Plausible Values in Large-Scale Assessments

    Science.gov (United States)

    Kaplan, David; Su, Dan

    2016-01-01

    This article presents findings on the consequences of matrix sampling of context questionnaires for the generation of plausible values in large-scale assessments. Three studies are conducted. Study 1 uses data from PISA 2012 to examine several different forms of missing data imputation within the chained equations framework: predictive mean…

  15. Is missing geographic positioning system data in accelerometry studies a problem, and is imputation the solution?

    DEFF Research Database (Denmark)

    Meseck, Kristin; Jankowska, Marta M; Schipperijn, Jasper

    2016-01-01

    The main purpose of the present study was to assess the impact of global positioning system (GPS) signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate...

  16. Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation

    Directory of Open Access Journals (Sweden)

    Ward Judson A

    2013-01-01

    Full Text Available Abstract Background Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry. Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker density, but result in some genotype errors and a large number of missing genotype values. Imputation can reduce the number of missing values and can correct genotyping errors, but current methods of imputation require a reference genome and thus are not an option for most species. Results Genotyping by Sequencing (GBS was used to produce highly saturated maps for a R. idaeus pseudo-testcross progeny. While low coverage and high variance in sequencing resulted in a large number of missing values for some individuals, a novel method of imputation based on maximum likelihood marker ordering from initial marker segregation overcame the challenge of missing values, and made map construction computationally tractable. The two resulting parental maps contained 4521 and 2391 molecular markers spanning 462.7 and 376.6 cM respectively over seven linkage groups. Detection of precise genomic regions with segregation distortion was possible because of map saturation. Microsatellites (SSRs linked these results to published maps for cross-validation and map comparison. Conclusions GBS together with genome-independent imputation provides a rapid method for genetic map construction in any pseudo-testcross progeny. Our method of imputation estimates the correct genotype call of missing values and corrects genotyping errors that lead to inflated map size and reduced precision in marker placement. Comparison of SSRs to published R. idaeus maps showed that the linkage maps constructed with GBS and our method of imputation were robust, and marker positioning reliable. The high marker density allowed identification of genomic regions with segregation

  17. Go green! Reusing brain monitoring data containing missing values: a feasibility study with traumatic brain injury patients.

    Science.gov (United States)

    Feng, Mengling; Loy, Liang Yu; Zhang, Feng; Zhang, Zhuo; Vellaisamy, Kuralmani; Chin, Pei Loon; Guan, Cuntai; Shen, Liang; King, Nicolas K K; Lee, Kah Keow; Ang, Beng Ti

    2012-01-01

    Despite the wealth of information carried, periodic brain monitoring data are often incomplete with a significant amount of missing values. Incomplete monitoring data are usually discarded to ensure purity of data. However, this approach leads to the loss of statistical power, potentially biased study and a great waste of resources. Thus, we propose to reuse incomplete brain monitoring data by imputing the missing values - a green solution! To support our proposal, we have conducted a feasibility study to investigate the reusability of incomplete brain monitoring data based on the estimated imputation error. Seventy-seven patients, who underwent invasive monitoring of ICP, MAP, PbtO (2) and brain temperature (BTemp) for more than 24 consecutive hours and were connected to a bedside computerized system, were selected for the study. In the feasibility study, the imputation error is experimentally assessed with simulated missing values and 17 state-of-the-art predictive methods. A framework is developed for neuroclinicians and neurosurgeons to determine the best re-usage strategy and predictive methods based on our feasibility study. The monitoring data of MAP and BTemp are more reliable for reuse than ICP and PbtO (2); and, for ICP and PbtO (2) data, a more cautious re-usage strategy should be employed. We also observe that, for the scenarios tested, the lazy learning method, K-STAR, and the tree-based method, M5P, are consistently 2 of the best among the 17 predictive methods investigated in this study.

  18. Preserving Logical Relations while Estimating Missing Values

    Directory of Open Access Journals (Sweden)

    Ton de Waal

    2017-09-01

    Full Text Available Item-nonresponse is often treated by means of an imputation technique. In some cases, the data have to satisfy certain constraints, which are frequently referred to as edits. An example of an edit for numerical data is that the profit of an enterprise equals its turnover minus its costs. Edits place restrictions on the imputations that are allowed and hence complicate the imputation process. In this paper we explore an adjustment approach. This adjustment approach consists of three steps. In the first step, the imputation step, nearest neighbour hot deck imputation is used to find several pre-imputed values. In a second step, the adjustment step, these pre-imputed values are adjusted so the resulting records satisfy all edits. In a third step, the best donor record is selected. The adjusted record corresponding to that donor record is the final imputed record. In principle, a potential donor that is not the closest to the record to be imputed may still give the best results after adjustment. In this paper we therefore focus on the number of potential donor records that are considered in the imputation step.

  19. Multiple imputation to account for missing data in a survey: estimating the prevalence of osteoporosis.

    Science.gov (United States)

    Kmetic, Andrew; Joseph, Lawrence; Berger, Claudie; Tenenhouse, Alan

    2002-07-01

    Nonresponse bias is a concern in any epidemiologic survey in which a subset of selected individuals declines to participate. We reviewed multiple imputation, a widely applicable and easy to implement Bayesian methodology to adjust for nonresponse bias. To illustrate the method, we used data from the Canadian Multicentre Osteoporosis Study, a large cohort study of 9423 randomly selected Canadians, designed in part to estimate the prevalence of osteoporosis. Although subjects were randomly selected, only 42% of individuals who were contacted agreed to participate fully in the study. The study design included a brief questionnaire for those invitees who declined further participation in order to collect information on the major risk factors for osteoporosis. These risk factors (which included age, sex, previous fractures, family history of osteoporosis, and current smoking status) were then used to estimate the missing osteoporosis status for nonparticipants using multiple imputation. Both ignorable and nonignorable imputation models are considered. Our results suggest that selection bias in the study is of concern, but only slightly, in very elderly (age 80+ years), both women and men. Epidemiologists should consider using multiple imputation more often than is current practice.

  20. DTW-APPROACH FOR UNCORRELATED MULTIVARIATE TIME SERIES IMPUTATION

    OpenAIRE

    Phan , Thi-Thu-Hong; Poisson Caillault , Emilie; Bigand , André; Lefebvre , Alain

    2017-01-01

    International audience; Missing data are inevitable in almost domains of applied sciences. Data analysis with missing values can lead to a loss of efficiency and unreliable results, especially for large missing sub-sequence(s). Some well-known methods for multivariate time series imputation require high correlations between series or their features. In this paper , we propose an approach based on the shape-behaviour relation in low/un-correlated multivariate time series under an assumption of...

  1. Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values.

    Directory of Open Access Journals (Sweden)

    Talayeh Razzaghi

    Full Text Available This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.

  2. LinkImputeR: user-guided genotype calling and imputation for non-model organisms.

    Science.gov (United States)

    Money, Daniel; Migicovsky, Zoë; Gardner, Kyle; Myles, Sean

    2017-07-10

    Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms. Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis. By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from

  3. Trend in BMI z-score among Private Schools’ Students in Delhi using Multiple Imputation for Growth Curve Model

    Directory of Open Access Journals (Sweden)

    Vinay K Gupta

    2016-06-01

    Full Text Available Objective: The aim of the study is to assess the trend in mean BMI z-score among private schools’ students from their anthropometric records when there were missing values in the outcome. Methodology: The anthropometric measurements of student from class 1 to 12 were taken from the records of two private schools in Delhi, India from 2005 to 2010. These records comprise of an unbalanced longitudinal data that is not all the students had measurements recorded at each year. The trend in mean BMI z-score was estimated through growth curve model. Prior to that, missing values of BMI z-score were imputed through multiple imputation using the same model. A complete case analysis was also performed after excluding missing values to compare the results with those obtained from analysis of multiply imputed data. Results: The mean BMI z-score among school student significantly decreased over time in imputed data (β= -0.2030, se=0.0889, p=0.0232 after adjusting age, gender, class and school. Complete case analysis also shows a decrease in mean BMI z-score though it was not statistically significant (β= -0.2861, se=0.0987, p=0.065. Conclusions: The estimates obtained from multiple imputation analysis were better than those of complete data after excluding missing values in terms of lower standard errors. We showed that anthropometric measurements from schools records can be used to monitor the weight status of children and adolescents and multiple imputation using growth curve model can be useful while analyzing such data

  4. Using beta coefficients to impute missing correlations in meta-analysis research: Reasons for caution.

    Science.gov (United States)

    Roth, Philip L; Le, Huy; Oh, In-Sue; Van Iddekinge, Chad H; Bobko, Philip

    2018-06-01

    Meta-analysis has become a well-accepted method for synthesizing empirical research about a given phenomenon. Many meta-analyses focus on synthesizing correlations across primary studies, but some primary studies do not report correlations. Peterson and Brown (2005) suggested that researchers could use standardized regression weights (i.e., beta coefficients) to impute missing correlations. Indeed, their beta estimation procedures (BEPs) have been used in meta-analyses in a wide variety of fields. In this study, the authors evaluated the accuracy of BEPs in meta-analysis. We first examined how use of BEPs might affect results from a published meta-analysis. We then developed a series of Monte Carlo simulations that systematically compared the use of existing correlations (that were not missing) to data sets that incorporated BEPs (that impute missing correlations from corresponding beta coefficients). These simulations estimated ρ̄ (mean population correlation) and SDρ (true standard deviation) across a variety of meta-analytic conditions. Results from both the existing meta-analysis and the Monte Carlo simulations revealed that BEPs were associated with potentially large biases when estimating ρ̄ and even larger biases when estimating SDρ. Using only existing correlations often substantially outperformed use of BEPs and virtually never performed worse than BEPs. Overall, the authors urge a return to the standard practice of using only existing correlations in meta-analysis. (PsycINFO Database Record (c) 2018 APA, all rights reserved).

  5. Data Editing and Imputation in Business Surveys Using “R”

    Directory of Open Access Journals (Sweden)

    Elena Romascanu

    2014-06-01

    Full Text Available Purpose – Missing data are a recurring problem that can cause bias or lead to inefficient analyses. The objective of this paper is a direct comparison between the two statistical software features R and SPSS, in order to take full advantage of the existing automated methods for data editing process and imputation in business surveys (with a proper design of consistency rules as a partial alternative to the manual editing of data. Approach – The comparison of different methods on editing surveys data, in R with the ‘editrules’ and ‘survey’ packages because inside those, exist commonly used transformations in official statistics, as visualization of missing values pattern using ‘Amelia’ and ‘VIM’ packages, imputation approaches for longitudinal data using ‘VIMGUI’ and a comparison of another statistical software performance on the same features, such as SPSS. Findings – Data on business statistics received by NIS’s (National Institute of Statistics are not ready to be used for direct analysis due to in-record inconsistencies, errors and missing values from the collected data sets. The appropriate automatic methods from R packages, offers the ability to set the erroneous fields in edit-violating records, to verify the results after the imputation of missing values providing for users a flexible, less time consuming approach and easy to perform automation in R than in SPSS Macros syntax situations, when macros are very handy.

  6. Multiple Imputation of Predictor Variables Using Generalized Additive Models

    NARCIS (Netherlands)

    de Jong, Roel; van Buuren, Stef; Spiess, Martin

    2016-01-01

    The sensitivity of multiple imputation methods to deviations from their distributional assumptions is investigated using simulations, where the parameters of scientific interest are the coefficients of a linear regression model, and values in predictor variables are missing at random. The

  7. Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS Data.

    Directory of Open Access Journals (Sweden)

    Ariel W Chan

    Full Text Available Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS methods, such as Genotyping-By-Sequencing (GBS, offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1 can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2 are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'. We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and

  8. Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.

    Science.gov (United States)

    Chan, Ariel W; Hamblin, Martha T; Jannink, Jean-Luc

    2016-01-01

    Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the

  9. Evaluation of missing value methods for predicting ambient BTEX concentrations in two neighbouring cities in Southwestern Ontario Canada

    Science.gov (United States)

    Miller, Lindsay; Xu, Xiaohong; Wheeler, Amanda; Zhang, Tianchu; Hamadani, Mariam; Ejaz, Unam

    2018-05-01

    High density air monitoring campaigns provide spatial patterns of pollutant concentrations which are integral in exposure assessment. Such analysis can assist with the determination of links between air quality and health outcomes, however, problems due to missing data can threaten to compromise these studies. This research evaluates four methods; mean value imputation, inverse distance weighting (IDW), inter-species ratios, and regression, to address missing spatial concentration data ranging from one missing data point up to 50% missing data. BTEX (benzene, toluene, ethylbenzene, and xylenes) concentrations were measured in Windsor and Sarnia, Ontario in the fall of 2005. Concentrations and inter-species ratios were generally similar between the two cities. Benzene (B) was observed to be higher in Sarnia, whereas toluene (T) and the T/B ratios were higher in Windsor. Using these urban, industrialized cities as case studies, this research demonstrates that using inter-species ratios or regression of the data for which there is complete information, along with one measured concentration (i.e. benzene) to predict for missing concentrations (i.e. TEX) results in good agreement between predicted and measured values. In both cities, the general trend remains that best agreement is observed for the leave-one-out scenario, followed by 10% and 25% missing, and the least agreement for the 50% missing cases. In the absence of any known concentrations IDW can provide reasonable agreement between observed and estimated concentrations for the BTEX species, and was superior over mean value imputation which was not able to preserve the spatial trend. The proposed methods can be used to fill in missing data, while preserving the general characteristics and rank order of the data which are sufficient for epidemiologic studies.

  10. A suggested approach for imputation of missing dietary data for young children in daycare

    OpenAIRE

    Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P.; Zeng, Donglin; Vaughn, Amber E.; Pratt, Charlotte; Ward, Dianne S.

    2015-01-01

    Background: Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult.Objective: The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method.Design: Data were from children aged 2-5 years in the My Parenting...

  11. 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data.

    Science.gov (United States)

    Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M

    2018-06-01

    A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.

  12. Imputation of missing genotypes within LD-blocks relying on the basic coalescent and beyond: consideration of population growth and structure.

    Science.gov (United States)

    Kabisch, Maria; Hamann, Ute; Lorenzo Bermejo, Justo

    2017-10-17

    Genotypes not directly measured in genetic studies are often imputed to improve statistical power and to increase mapping resolution. The accuracy of standard imputation techniques strongly depends on the similarity of linkage disequilibrium (LD) patterns in the study and reference populations. Here we develop a novel approach for genotype imputation in low-recombination regions that relies on the coalescent and permits to explicitly account for population demographic factors. To test the new method, study and reference haplotypes were simulated and gene trees were inferred under the basic coalescent and also considering population growth and structure. The reference haplotypes that first coalesced with study haplotypes were used as templates for genotype imputation. Computer simulations were complemented with the analysis of real data. Genotype concordance rates were used to compare the accuracies of coalescent-based and standard (IMPUTE2) imputation. Simulations revealed that, in LD-blocks, imputation accuracy relying on the basic coalescent was higher and less variable than with IMPUTE2. Explicit consideration of population growth and structure, even if present, did not practically improve accuracy. The advantage of coalescent-based over standard imputation increased with the minor allele frequency and it decreased with population stratification. Results based on real data indicated that, even in low-recombination regions, further research is needed to incorporate recombination in coalescence inference, in particular for studies with genetically diverse and admixed individuals. To exploit the full potential of coalescent-based methods for the imputation of missing genotypes in genetic studies, further methodological research is needed to reduce computer time, to take into account recombination, and to implement these methods in user-friendly computer programs. Here we provide reproducible code which takes advantage of publicly available software to facilitate

  13. The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer

    Directory of Open Access Journals (Sweden)

    Rosa Aghdam

    2017-12-01

    Full Text Available Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/.

  14. Relying on Your Own Best Judgment: Imputing Values to Missing Information in Decision Making.

    Science.gov (United States)

    Johnson, Richard D.; And Others

    Processes involved in making estimates of the value of missing information that could help in a decision making process were studied. Hypothetical purchases of ground beef were selected for the study as such purchases have the desirable property of quantifying both the price and quality. A total of 150 students at the University of Iowa rated the…

  15. Limitations in Using Multiple Imputation to Harmonize Individual Participant Data for Meta-Analysis.

    Science.gov (United States)

    Siddique, Juned; de Chavez, Peter J; Howe, George; Cruden, Gracelyn; Brown, C Hendricks

    2018-02-01

    Individual participant data (IPD) meta-analysis is a meta-analysis in which the individual-level data for each study are obtained and used for synthesis. A common challenge in IPD meta-analysis is when variables of interest are measured differently in different studies. The term harmonization has been coined to describe the procedure of placing variables on the same scale in order to permit pooling of data from a large number of studies. Using data from an IPD meta-analysis of 19 adolescent depression trials, we describe a multiple imputation approach for harmonizing 10 depression measures across the 19 trials by treating those depression measures that were not used in a study as missing data. We then apply diagnostics to address the fit of our imputation model. Even after reducing the scale of our application, we were still unable to produce accurate imputations of the missing values. We describe those features of the data that made it difficult to harmonize the depression measures and provide some guidelines for using multiple imputation for harmonization in IPD meta-analysis.

  16. The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer.

    Science.gov (United States)

    Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz

    2017-12-01

    Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.

  17. Sample-Based Extreme Learning Machine with Missing Data

    Directory of Open Access Journals (Sweden)

    Hang Gao

    2015-01-01

    Full Text Available Extreme learning machine (ELM has been extensively studied in machine learning community during the last few decades due to its high efficiency and the unification of classification, regression, and so forth. Though bearing such merits, existing ELM algorithms cannot efficiently handle the issue of missing data, which is relatively common in practical applications. The problem of missing data is commonly handled by imputation (i.e., replacing missing values with substituted values according to available information. However, imputation methods are not always effective. In this paper, we propose a sample-based learning framework to address this issue. Based on this framework, we develop two sample-based ELM algorithms for classification and regression, respectively. Comprehensive experiments have been conducted in synthetic data sets, UCI benchmark data sets, and a real world fingerprint image data set. As indicated, without introducing extra computational complexity, the proposed algorithms do more accurate and stable learning than other state-of-the-art ones, especially in the case of higher missing ratio.

  18. Incomplete Data in Smart Grid: Treatment of Values in Electric Vehicle Charging Data

    Energy Technology Data Exchange (ETDEWEB)

    Majipour, Mostafa; Chu, Peter; Gadh, Rajit; Pota, Hemanshu R.

    2014-11-03

    In this paper, five imputation methods namely Constant (zero), Mean, Median, Maximum Likelihood, and Multiple Imputation methods have been applied to compensate for missing values in Electric Vehicle (EV) charging data. The outcome of each of these methods have been used as the input to a prediction algorithm to forecast the EV load in the next 24 hours at each individual outlet. The data is real world data at the outlet level from the UCLA campus parking lots. Given the sparsity of the data, both Median and Constant (=zero) imputations improved the prediction results. Since in most missing value cases in our database, all values of that instance are missing, the multivariate imputation methods did not improve the results significantly compared to univariate approaches.

  19. Detection of Outliers and Imputing of Missing Values for Water Quality UV-VIS Absorbance Time Series

    Directory of Open Access Journals (Sweden)

    Leonardo Plazas-Nossa

    2017-01-01

    Full Text Available Context: The UV-Vis absorbance collection using online optical captors for water quality detection may yield outliers and/or missing values. Therefore, data pre-processing is a necessary pre-requisite to monitoring data processing. Thus, the aim of this study is to propose a method that detects and removes outliers as well as fills gaps in time series. Method: Outliers are detected using Winsorising procedure and the application of the Discrete Fourier Transform (DFT and the Inverse of Fast Fourier Transform (IFFT to complete the time series. Together, these tools were used to analyse a case study comprising three sites in Colombia ((i Bogotá D.C. Salitre-WWTP (Waste Water Treatment Plant, influent; (ii Bogotá D.C. Gibraltar Pumping Station (GPS; and, (iii Itagüí, San Fernando-WWTP, influent (Medellín metropolitan area analysed via UV-Vis (Ultraviolet and Visible spectra. Results: Outlier detection with the proposed method obtained promising results when window parameter values are small and self-similar, despite that the three time series exhibited different sizes and behaviours. The DFT allowed to process different length gaps having missing values. To assess the validity of the proposed method, continuous subsets (a section of the absorbance time series without outlier or missing values were removed from the original time series obtaining an average 12% error rate in the three testing time series. Conclusions: The application of the DFT and the IFFT, using the 10% most important harmonics of useful values, can be useful for its later use in different applications, specifically for time series of water quality and quantity in urban sewer systems. One potential application would be the analysis of dry weather interesting to rain events, a feat achieved by detecting values that correspond to unusual behaviour in a time series. Additionally, the result hints at the potential of the method in correcting other hydrologic time series.

  20. Treating pre-instrumental data as "missing" data: using a tree-ring-based paleoclimate record and imputations to reconstruct streamflow in the Missouri River Basin

    Science.gov (United States)

    Ho, M. W.; Lall, U.; Cook, E. R.

    2015-12-01

    Advances in paleoclimatology in the past few decades have provided opportunities to expand the temporal perspective of the hydrological and climatological variability across the world. The North American region is particularly fortunate in this respect where a relatively dense network of high resolution paleoclimate proxy records have been assembled. One such network is the annually-resolved Living Blended Drought Atlas (LBDA): a paleoclimate reconstruction of the Palmer Drought Severity Index (PDSI) that covers North America on a 0.5° × 0.5° grid based on tree-ring chronologies. However, the use of the LBDA to assess North American streamflow variability requires a model by which streamflow may be reconstructed. Paleoclimate reconstructions have typically used models that first seek to quantify the relationship between the paleoclimate variable and the environmental variable of interest before extrapolating the relationship back in time. In contrast, the pre-instrumental streamflow is here considered as "missing" data. A method of imputing the "missing" streamflow data, prior to the instrumental record, is applied through multiple imputation using chained equations for streamflow in the Missouri River Basin. In this method, the distribution of the instrumental streamflow and LBDA is used to estimate sets of plausible values for the "missing" streamflow data resulting in a ~600 year-long streamflow reconstruction. Past research into external climate forcings, oceanic-atmospheric variability and its teleconnections, and assessments of rare multi-centennial instrumental records demonstrate that large temporal oscillations in hydrological conditions are unlikely to be captured in most instrumental records. The reconstruction of multi-centennial records of streamflow will enable comprehensive assessments of current and future water resource infrastructure and operations under the existing scope of natural climate variability.

  1. Bootstrap inference when using multiple imputation.

    Science.gov (United States)

    Schomaker, Michael; Heumann, Christian

    2018-04-16

    Many modern estimators require bootstrapping to calculate confidence intervals because either no analytic standard error is available or the distribution of the parameter of interest is nonsymmetric. It remains however unclear how to obtain valid bootstrap inference when dealing with multiple imputation to address missing data. We present 4 methods that are intuitively appealing, easy to implement, and combine bootstrap estimation with multiple imputation. We show that 3 of the 4 approaches yield valid inference, but that the performance of the methods varies with respect to the number of imputed data sets and the extent of missingness. Simulation studies reveal the behavior of our approaches in finite samples. A topical analysis from HIV treatment research, which determines the optimal timing of antiretroviral treatment initiation in young children, demonstrates the practical implications of the 4 methods in a sophisticated and realistic setting. This analysis suffers from missing data and uses the g-formula for inference, a method for which no standard errors are available. Copyright © 2018 John Wiley & Sons, Ltd.

  2. Differential network analysis with multiply imputed lipidomic data.

    Directory of Open Access Journals (Sweden)

    Maiju Kujala

    Full Text Available The importance of lipids for cell function and health has been widely recognized, e.g., a disorder in the lipid composition of cells has been related to atherosclerosis caused cardiovascular disease (CVD. Lipidomics analyses are characterized by large yet not a huge number of mutually correlated variables measured and their associations to outcomes are potentially of a complex nature. Differential network analysis provides a formal statistical method capable of inferential analysis to examine differences in network structures of the lipids under two biological conditions. It also guides us to identify potential relationships requiring further biological investigation. We provide a recipe to conduct permutation test on association scores resulted from partial least square regression with multiple imputed lipidomic data from the LUdwigshafen RIsk and Cardiovascular Health (LURIC study, particularly paying attention to the left-censored missing values typical for a wide range of data sets in life sciences. Left-censored missing values are low-level concentrations that are known to exist somewhere between zero and a lower limit of quantification. To make full use of the LURIC data with the missing values, we utilize state of the art multiple imputation techniques and propose solutions to the challenges that incomplete data sets bring to differential network analysis. The customized network analysis helps us to understand the complexities of the underlying biological processes by identifying lipids and lipid classes that interact with each other, and by recognizing the most important differentially expressed lipids between two subgroups of coronary artery disease (CAD patients, the patients that had a fatal CVD event and the ones who remained stable during two year follow-up.

  3. Comparison of three boosting methods in parent-offspring trios for genotype imputation using simulation study

    Directory of Open Access Journals (Sweden)

    Abbas Mikhchi

    2016-01-01

    Full Text Available Abstract Background Genotype imputation is an important process of predicting unknown genotypes, which uses reference population with dense genotypes to predict missing genotypes for both human and animal genetic variations at a low cost. Machine learning methods specially boosting methods have been used in genetic studies to explore the underlying genetic profile of disease and build models capable of predicting missing values of a marker. Methods In this study strategies and factors affecting the imputation accuracy of parent-offspring trios compared from lower-density SNP panels (5 K to high density (10 K SNP panel using three different Boosting methods namely TotalBoost (TB, LogitBoost (LB and AdaBoost (AB. The methods employed using simulated data to impute the un-typed SNPs in parent-offspring trios. Four different datasets of G1 (100 trios with 5 k SNPs, G2 (100 trios with 10 k SNPs, G3 (500 trios with 5 k SNPs, and G4 (500 trio with 10 k SNPs were simulated. In four datasets all parents were genotyped completely, and offspring genotyped with a lower density panel. Results Comparison of the three methods for imputation showed that the LB outperformed AB and TB for imputation accuracy. The time of computation were different between methods. The AB was the fastest algorithm. The higher SNP densities resulted the increase of the accuracy of imputation. Larger trios (i.e. 500 was better for performance of LB and TB. Conclusions The conclusion is that the three methods do well in terms of imputation accuracy also the dense chip is recommended for imputation of parent-offspring trios.

  4. Analyzing the changing gender wage gap based on multiply imputed right censored wages

    OpenAIRE

    Gartner, Hermann; Rässler, Susanne

    2005-01-01

    "In order to analyze the gender wage gap with the German IAB-employment register we have to solve the problem of censored wages at the upper limit of the social security system. We treat this problem as a missing data problem. We regard the missingness mechanism as not missing at random (NMAR, according to Little and Rubin, 1987, 2002) as well as missing by design. The censored wages are multiply imputed by draws of a random variable from a truncated distribution. The multiple imputation is b...

  5. Factors associated with low birth weight in Nepal using multiple imputation

    Directory of Open Access Journals (Sweden)

    Usha Singh

    2017-02-01

    Full Text Available Abstract Background Survey data from low income countries on birth weight usually pose a persistent problem. The studies conducted on birth weight have acknowledged missing data on birth weight, but they are not included in the analysis. Furthermore, other missing data presented on determinants of birth weight are not addressed. Thus, this study tries to identify determinants that are associated with low birth weight (LBW using multiple imputation to handle missing data on birth weight and its determinants. Methods The child dataset from Nepal Demographic and Health Survey (NDHS, 2011 was utilized in this study. A total of 5,240 children were born between 2006 and 2011, out of which 87% had at least one measured variable missing and 21% had no recorded birth weight. All the analyses were carried out in R version 3.1.3. Transform-then impute method was applied to check for interaction between explanatory variables and imputed missing data. Survey package was applied to each imputed dataset to account for survey design and sampling method. Survey logistic regression was applied to identify the determinants associated with LBW. Results The prevalence of LBW was 15.4% after imputation. Women with the highest autonomy on their own health compared to those with health decisions involving husband or others (adjusted odds ratio (OR 1.87, 95% confidence interval (95% CI = 1.31, 2.67, and husband and women together (adjusted OR 1.57, 95% CI = 1.05, 2.35 were less likely to give birth to LBW infants. Mothers using highly polluting cooking fuels (adjusted OR 1.49, 95% CI = 1.03, 2.22 were more likely to give birth to LBW infants than mothers using non-polluting cooking fuels. Conclusion The findings of this study suggested that obtaining the prevalence of LBW from only the sample of measured birth weight and ignoring missing data results in underestimation.

  6. A regressive methodology for estimating missing data in rainfall daily time series

    Science.gov (United States)

    Barca, E.; Passarella, G.

    2009-04-01

    The "presence" of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results (Rubin, 1976). Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of "missingness" mechanisms. When data missing is conditioned by some other variable observed in the data set (Schafer, 1997) the mechanism is called MAR (Missing at Random). Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR (Not Missing at Random). This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression (single imputation). More recently multiple imputation has become also available, which returns a distribution of estimated values (Scheffer, 2002). In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data (target station), the methodology checks the randomness of the missing data and classifies the "similarity" between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station: the Theil method (Theil, 1950). Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to

  7. Multiple imputation of rainfall missing data in the Iberian Mediterranean context

    Science.gov (United States)

    Miró, Juan Javier; Caselles, Vicente; Estrela, María José

    2017-11-01

    Given the increasing need for complete rainfall data networks, in recent years have been proposed diverse methods for filling gaps in observed precipitation series, progressively more advanced that traditional approaches to overcome the problem. The present study has consisted in validate 10 methods (6 linear, 2 non-linear and 2 hybrid) that allow multiple imputation, i.e., fill at the same time missing data of multiple incomplete series in a dense network of neighboring stations. These were applied for daily and monthly rainfall in two sectors in the Júcar River Basin Authority (east Iberian Peninsula), which is characterized by a high spatial irregularity and difficulty of rainfall estimation. A classification of precipitation according to their genetic origin was applied as pre-processing, and a quantile-mapping adjusting as post-processing technique. The results showed in general a better performance for the non-linear and hybrid methods, highlighting that the non-linear PCA (NLPCA) method outperforms considerably the Self Organizing Maps (SOM) method within non-linear approaches. On linear methods, the Regularized Expectation Maximization method (RegEM) was the best, but far from NLPCA. Applying EOF filtering as post-processing of NLPCA (hybrid approach) yielded the best results.

  8. Avoid Filling Swiss Cheese with Whipped Cream; Imputation Techniques and Evaluation Procedures for Cross-Country Time Series

    OpenAIRE

    Michael Weber; Michaela Denk

    2011-01-01

    International organizations collect data from national authorities to create multivariate cross-sectional time series for their analyses. As data from countries with not yet well-established statistical systems may be incomplete, the bridging of data gaps is a crucial challenge. This paper investigates data structures and missing data patterns in the cross-sectional time series framework, reviews missing value imputation techniques used for micro data in official statistics, and discusses the...

  9. Assessment of imputation methods using varying ecological information to fill the gaps in a tree functional trait database

    Science.gov (United States)

    Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi

    2016-04-01

    Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix

  10. Cost reduction for web-based data imputation

    KAUST Repository

    Li, Zhixu

    2014-01-01

    Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy. © 2014 Springer International Publishing Switzerland.

  11. Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks.

    Science.gov (United States)

    Li, YuanYuan; Parker, Lynne E

    2014-01-01

    Missing data is common in Wireless Sensor Networks (WSNs), especially with multi-hop communications. There are many reasons for this phenomenon, such as unstable wireless communications, synchronization issues, and unreliable sensors. Unfortunately, missing data creates a number of problems for WSNs. First, since most sensor nodes in the network are battery-powered, it is too expensive to have the nodes retransmit missing data across the network. Data re-transmission may also cause time delays when detecting abnormal changes in an environment. Furthermore, localized reasoning techniques on sensor nodes (such as machine learning algorithms to classify states of the environment) are generally not robust enough to handle missing data. Since sensor data collected by a WSN is generally correlated in time and space, we illustrate how replacing missing sensor values with spatially and temporally correlated sensor values can significantly improve the network's performance. However, our studies show that it is important to determine which nodes are spatially and temporally correlated with each other. Simple techniques based on Euclidean distance are not sufficient for complex environmental deployments. Thus, we have developed a novel Nearest Neighbor (NN) imputation method that estimates missing data in WSNs by learning spatial and temporal correlations between sensor nodes. To improve the search time, we utilize a k d-tree data structure, which is a non-parametric, data-driven binary search tree. Instead of using traditional mean and variance of each dimension for k d-tree construction, and Euclidean distance for k d-tree search, we use weighted variances and weighted Euclidean distances based on measured percentages of missing data. We have evaluated this approach through experiments on sensor data from a volcano dataset collected by a network of Crossbow motes, as well as experiments using sensor data from a highway traffic monitoring application. Our experimental

  12. Quick, “Imputation-free” meta-analysis with proxy-SNPs

    Directory of Open Access Journals (Sweden)

    Meesters Christian

    2012-09-01

    Full Text Available Abstract Background Meta-analysis (MA is widely used to pool genome-wide association studies (GWASes in order to a increase the power to detect strong or weak genotype effects or b as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia to avoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software, however, enables cross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming. Results Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying on imputation. This is accomplished by using reference linkage disequilibrium data from 1,000 Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least one study. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Our algorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applications can be used as input files for MA, tremendously speeding up MA compared to the conventional imputation approach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possibly providing an incentive for follow-up studies. We propose our method as a quick screening step prior to imputation-based MA, as well as an additional main approach for studies without available reference data matching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II Diabetes GWAS and found that the proxy algorithm clearly outperforms naïve MA on the p-value level: for 17 out of 23 we observe an improvement on the p-value level by a factor of more than two, and a maximum improvement by a factor of 2127. Conclusions YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventional MA as well as inserting proxy

  13. Assessing and comparison of different machine learning methods in parent-offspring trios for genotype imputation.

    Science.gov (United States)

    Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi

    2016-06-21

    Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. Copyright © 2016 Elsevier Ltd. All rights reserved.

  14. Detection of Outliers and Imputing of Missing Values for Water Quality UV-VIS Absorbance Time Series

    OpenAIRE

    Plazas-Nossa, Leonardo; Ávila Angulo, Miguel Antonio; Torres, Andrés

    2017-01-01

    Context:The UV-Vis absorbance collection using online optical captors for water quality detection may yield outliers and/or missing values. Therefore, pre-processing to correct these anomalies is required to improve the analysis of monitoring data. The aim of this study is to propose a method to detect outliers as well as to fill-in the gaps in time series. Method:Outliers are detected using Winsorising procedure and the application of the Discrete Fourier Transform (DFT) and the Inverse of F...

  15. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information

    Science.gov (United States)

    Poyatos, Rafael; Sus, Oliver; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi

    2018-05-01

    The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10-80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in

  16. Estimating range of influence in case of missing spatial data

    DEFF Research Database (Denmark)

    Bihrmann, Kristine; Ersbøll, Annette Kjær

    2015-01-01

    BACKGROUND: The range of influence refers to the average distance between locations at which the observed outcome is no longer correlated. In many studies, missing data occur and a popular tool for handling missing data is multiple imputation. The objective of this study was to investigate how...... the estimated range of influence is affected when 1) the outcome is only observed at some of a given set of locations, and 2) multiple imputation is used to impute the outcome at the non-observed locations. METHODS: The study was based on the simulation of missing outcomes in a complete data set. The range...... of influence was estimated from a logistic regression model with a spatially structured random effect, modelled by a Gaussian field. Results were evaluated by comparing estimates obtained from complete, missing, and imputed data. RESULTS: In most simulation scenarios, the range estimates were consistent...

  17. Multiple imputation and its application

    CERN Document Server

    Carpenter, James

    2013-01-01

    A practical guide to analysing partially observed data. Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete  data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods. This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures. Multiple Imputation and its Application: Discusses the issues ...

  18. Missing data methods for dealing with missing items in quality of life questionnaires. A comparison by simulation of personal mean score, full information maximum likelihood, multiple imputation, and hot deck techniques applied to the SF-36 in the French 2003 decennial health survey.

    Science.gov (United States)

    Peyre, Hugo; Leplège, Alain; Coste, Joël

    2011-03-01

    Missing items are common in quality of life (QoL) questionnaires and present a challenge for research in this field. It remains unclear which of the various methods proposed to deal with missing data performs best in this context. We compared personal mean score, full information maximum likelihood, multiple imputation, and hot deck techniques using various realistic simulation scenarios of item missingness in QoL questionnaires constructed within the framework of classical test theory. Samples of 300 and 1,000 subjects were randomly drawn from the 2003 INSEE Decennial Health Survey (of 23,018 subjects representative of the French population and having completed the SF-36) and various patterns of missing data were generated according to three different item non-response rates (3, 6, and 9%) and three types of missing data (Little and Rubin's "missing completely at random," "missing at random," and "missing not at random"). The missing data methods were evaluated in terms of accuracy and precision for the analysis of one descriptive and one association parameter for three different scales of the SF-36. For all item non-response rates and types of missing data, multiple imputation and full information maximum likelihood appeared superior to the personal mean score and especially to hot deck in terms of accuracy and precision; however, the use of personal mean score was associated with insignificant bias (relative bias personal mean score appears nonetheless appropriate for dealing with items missing from completed SF-36 questionnaires in most situations of routine use. These results can reasonably be extended to other questionnaires constructed according to classical test theory.

  19. Synthetic Multiple-Imputation Procedure for Multistage Complex Samples

    Directory of Open Access Journals (Sweden)

    Zhou Hanzhi

    2016-03-01

    Full Text Available Multiple imputation (MI is commonly used when item-level missing data are present. However, MI requires that survey design information be built into the imputation models. For multistage stratified clustered designs, this requires dummy variables to represent strata as well as primary sampling units (PSUs nested within each stratum in the imputation model. Such a modeling strategy is not only operationally burdensome but also inferentially inefficient when there are many strata in the sample design. Complexity only increases when sampling weights need to be modeled. This article develops a generalpurpose analytic strategy for population inference from complex sample designs with item-level missingness. In a simulation study, the proposed procedures demonstrate efficient estimation and good coverage properties. We also consider an application to accommodate missing body mass index (BMI data in the analysis of BMI percentiles using National Health and Nutrition Examination Survey (NHANES III data. We argue that the proposed methods offer an easy-to-implement solution to problems that are not well-handled by current MI techniques. Note that, while the proposed method borrows from the MI framework to develop its inferential methods, it is not designed as an alternative strategy to release multiply imputed datasets for complex sample design data, but rather as an analytic strategy in and of itself.

  20. Breast Cancer and Modifiable Lifestyle Factors in Argentinean Women: Addressing Missing Data in a Case-Control Study

    Science.gov (United States)

    Coquet, Julia Becaria; Tumas, Natalia; Osella, Alberto Ruben; Tanzi, Matteo; Franco, Isabella; Diaz, Maria Del Pilar

    2016-01-01

    A number of studies have evidenced the effect of modifiable lifestyle factors such as diet, breastfeeding and nutritional status on breast cancer risk. However, none have addressed the missing data problem in nutritional epidemiologic research in South America. Missing data is a frequent problem in breast cancer studies and epidemiological settings in general. Estimates of effect obtained from these studies may be biased, if no appropriate method for handling missing data is applied. We performed Multiple Imputation for missing values on covariates in a breast cancer case-control study of Córdoba (Argentina) to optimize risk estimates. Data was obtained from a breast cancer case control study from 2008 to 2015 (318 cases, 526 controls). Complete case analysis and multiple imputation using chained equations were the methods applied to estimate the effects of a Traditional dietary pattern and other recognized factors associated with breast cancer. Physical activity and socioeconomic status were imputed. Logistic regression models were performed. When complete case analysis was performed only 31% of women were considered. Although a positive association of Traditional dietary pattern and breast cancer was observed from both approaches (complete case analysis OR=1.3, 95%CI=1.0-1.7; multiple imputation OR=1.4, 95%CI=1.2-1.7), effects of other covariates, like BMI and breastfeeding, were only identified when multiple imputation was considered. A Traditional dietary pattern, BMI and breastfeeding are associated with the occurrence of breast cancer in this Argentinean population when multiple imputation is appropriately performed. Multiple Imputation is suggested in Latin America’s epidemiologic studies to optimize effect estimates in the future. PMID:27892664

  1. Handling missing Mini-Mental State Examination (MMSE) values: Results from a cross-sectional long-term-care study.

    Science.gov (United States)

    Godin, Judith; Keefe, Janice; Andrew, Melissa K

    2017-04-01

    Missing values are commonly encountered on the Mini Mental State Examination (MMSE), particularly when administered to frail older people. This presents challenges for MMSE scoring in research settings. We sought to describe missingness in MMSEs administered in long-term-care facilities (LTCF) and to compare and contrast approaches to dealing with missing items. As part of the Care and Construction project in Nova Scotia, Canada, LTCF residents completed an MMSE. Different methods of dealing with missing values (e.g., use of raw scores, raw scores/number of items attempted, scale-level multiple imputation [MI], and blended approaches) are compared to item-level MI. The MMSE was administered to 320 residents living in 23 LTCF. The sample was predominately female (73%), and 38% of participants were aged >85 years. At least one item was missing from 122 (38.2%) of the MMSEs. Data were not Missing Completely at Random (MCAR), χ 2 (1110) = 1,351, p < 0.001. Using raw scores for those missing <6 items in combination with scale-level MI resulted in the regression coefficients and standard errors closest to item-level MI. Patterns of missing items often suggest systematic problems, such as trouble with manual dexterity, literacy, or visual impairment. While these observations may be relatively easy to take into account in clinical settings, non-random missingness presents challenges for research and must be considered in statistical analyses. We present suggestions for dealing with missing MMSE data based on the extent of missingness and the goal of analyses. Copyright © 2016 The Authors. Production and hosting by Elsevier B.V. All rights reserved.

  2. Handling Missing Data in Structural Equation Models in R: A Replication Study for Applied Researchers

    Science.gov (United States)

    Wolgast, Anett; Schwinger, Malte; Hahnel, Carolin; Stiensmeier-Pelster, Joachim

    2017-01-01

    Introduction: Multiple imputation (MI) is one of the most highly recommended methods for replacing missing values in research data. The scope of this paper is to demonstrate missing data handling in SEM by analyzing two modified data examples from educational psychology, and to give practical recommendations for applied researchers. Method: We…

  3. Bayesian nonparametric generative models for causal inference with missing at random covariates.

    Science.gov (United States)

    Roy, Jason; Lum, Kirsten J; Zeldow, Bret; Dworkin, Jordan D; Re, Vincent Lo; Daniels, Michael J

    2018-03-26

    We propose a general Bayesian nonparametric (BNP) approach to causal inference in the point treatment setting. The joint distribution of the observed data (outcome, treatment, and confounders) is modeled using an enriched Dirichlet process. The combination of the observed data model and causal assumptions allows us to identify any type of causal effect-differences, ratios, or quantile effects, either marginally or for subpopulations of interest. The proposed BNP model is well-suited for causal inference problems, as it does not require parametric assumptions about the distribution of confounders and naturally leads to a computationally efficient Gibbs sampling algorithm. By flexibly modeling the joint distribution, we are also able to impute (via data augmentation) values for missing covariates within the algorithm under an assumption of ignorable missingness, obviating the need to create separate imputed data sets. This approach for imputing the missing covariates has the additional advantage of guaranteeing congeniality between the imputation model and the analysis model, and because we use a BNP approach, parametric models are avoided for imputation. The performance of the method is assessed using simulation studies. The method is applied to data from a cohort study of human immunodeficiency virus/hepatitis C virus co-infected patients. © 2018, The International Biometric Society.

  4. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information

    Directory of Open Access Journals (Sweden)

    R. Poyatos

    2018-05-01

    Full Text Available The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2. We simulated gaps at different missingness levels (10–80 % in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN, ordinary and regression kriging, and multivariate imputation using chained equations (MICE to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %, species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables

  5. Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: The general location model.

    Science.gov (United States)

    Seaman, Shaun R; Hughes, Rachael A

    2018-06-01

    Estimating the parameters of a regression model of interest is complicated by missing data on the variables in that model. Multiple imputation is commonly used to handle these missing data. Joint model multiple imputation and full-conditional specification multiple imputation are known to yield imputed data with the same asymptotic distribution when the conditional models of full-conditional specification are compatible with that joint model. We show that this asymptotic equivalence of imputation distributions does not imply that joint model multiple imputation and full-conditional specification multiple imputation will also yield asymptotically equally efficient inference about the parameters of the model of interest, nor that they will be equally robust to misspecification of the joint model. When the conditional models used by full-conditional specification multiple imputation are linear, logistic and multinomial regressions, these are compatible with a restricted general location joint model. We show that multiple imputation using the restricted general location joint model can be substantially more asymptotically efficient than full-conditional specification multiple imputation, but this typically requires very strong associations between variables. When associations are weaker, the efficiency gain is small. Moreover, full-conditional specification multiple imputation is shown to be potentially much more robust than joint model multiple imputation using the restricted general location model to mispecification of that model when there is substantial missingness in the outcome variable.

  6. Methods for Mediation Analysis with Missing Data

    Science.gov (United States)

    Zhang, Zhiyong; Wang, Lijuan

    2013-01-01

    Despite wide applications of both mediation models and missing data techniques, formal discussion of mediation analysis with missing data is still rare. We introduce and compare four approaches to dealing with missing data in mediation analysis including list wise deletion, pairwise deletion, multiple imputation (MI), and a two-stage maximum…

  7. The use of multiple imputation for the accurate measurements of individual feed intake by electronic feeders.

    Science.gov (United States)

    Jiao, S; Tiezzi, F; Huang, Y; Gray, K A; Maltecca, C

    2016-02-01

    Obtaining accurate individual feed intake records is the key first step in achieving genetic progress toward more efficient nutrient utilization in pigs. Feed intake records collected by electronic feeding systems contain errors (erroneous and abnormal values exceeding certain cutoff criteria), which are due to feeder malfunction or animal-feeder interaction. In this study, we examined the use of a novel data-editing strategy involving multiple imputation to minimize the impact of errors and missing values on the quality of feed intake data collected by an electronic feeding system. Accuracy of feed intake data adjustment obtained from the conventional linear mixed model (LMM) approach was compared with 2 alternative implementations of multiple imputation by chained equation, denoted as MI (multiple imputation) and MICE (multiple imputation by chained equation). The 3 methods were compared under 3 scenarios, where 5, 10, and 20% feed intake error rates were simulated. Each of the scenarios was replicated 5 times. Accuracy of the alternative error adjustment was measured as the correlation between the true daily feed intake (DFI; daily feed intake in the testing period) or true ADFI (the mean DFI across testing period) and the adjusted DFI or adjusted ADFI. In the editing process, error cutoff criteria are used to define if a feed intake visit contains errors. To investigate the possibility that the error cutoff criteria may affect any of the 3 methods, the simulation was repeated with 2 alternative error cutoff values. Multiple imputation methods outperformed the LMM approach in all scenarios with mean accuracies of 96.7, 93.5, and 90.2% obtained with MI and 96.8, 94.4, and 90.1% obtained with MICE compared with 91.0, 82.6, and 68.7% using LMM for DFI. Similar results were obtained for ADFI. Furthermore, multiple imputation methods consistently performed better than LMM regardless of the cutoff criteria applied to define errors. In conclusion, multiple imputation

  8. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant-phenotype associations undetected by HapMap based imputation.

    Science.gov (United States)

    Wood, Andrew R; Perry, John R B; Tanaka, Toshiko; Hernandez, Dena G; Zheng, Hou-Feng; Melzer, David; Gibbs, J Raphael; Nalls, Michael A; Weedon, Michael N; Spector, Tim D; Richards, J Brent; Bandinelli, Stefania; Ferrucci, Luigi; Singleton, Andrew B; Frayling, Timothy M

    2013-01-01

    Genome-wide association (GWA) studies have been limited by the reliance on common variants present on microarrays or imputable from the HapMap Project data. More recently, the completion of the 1000 Genomes Project has provided variant and haplotype information for several million variants derived from sequencing over 1,000 individuals. To help understand the extent to which more variants (including low frequency (1% ≤ MAF 1000 Genomes imputation, respectively, and 9 and 11 that reached a stricter, likely conservative, threshold of P1000 Genomes genotype data modestly improved the strength of known associations. Of 20 associations detected at P1000 Genomes imputed data and one was nominally more strongly associated in HapMap imputed data. We also detected an association between a low frequency variant and phenotype that was previously missed by HapMap based imputation approaches. An association between rs112635299 and alpha-1 globulin near the SERPINA gene represented the known association between rs28929474 (MAF = 0.007) and alpha1-antitrypsin that predisposes to emphysema (P = 2.5×10(-12)). Our data provide important proof of principle that 1000 Genomes imputation will detect novel, low frequency-large effect associations.

  9. UniFIeD Univariate Frequency-based Imputation for Time Series Data

    OpenAIRE

    Friese, Martina; Stork, Jörg; Ramos Guerra, Ricardo; Bartz-Beielstein, Thomas; Thaker, Soham; Flasch, Oliver; Zaefferer, Martin

    2013-01-01

    This paper introduces UniFIeD, a new data preprocessing method for time series. UniFIeD can cope with large intervals of missing data. A scalable test function generator, which allows the simulation of time series with different gap sizes, is presented additionally. An experimental study demonstrates that (i) UniFIeD shows a significant better performance than simple imputation methods and (ii) UniFIeD is able to handle situations, where advanced imputation methods fail. The results are indep...

  10. Bioinspired Computational Approach to Missing Value Estimation

    Directory of Open Access Journals (Sweden)

    Israel Edem Agbehadji

    2018-01-01

    Full Text Available Missing data occurs when values of variables in a dataset are not stored. Estimating these missing values is a significant step during the data cleansing phase of a big data management approach. The reason of missing data may be due to nonresponse or omitted entries. If these missing data are not handled properly, this may create inaccurate results during data analysis. Although a traditional method such as maximum likelihood method extrapolates missing values, this paper proposes a bioinspired method based on the behavior of birds, specifically the Kestrel bird. This paper describes the behavior and characteristics of the Kestrel bird, a bioinspired approach, in modeling an algorithm to estimate missing values. The proposed algorithm (KSA was compared with WSAMP, Firefly, and BAT algorithm. The results were evaluated using the mean of absolute error (MAE. A statistical test (Wilcoxon signed-rank test and Friedman test was conducted to test the performance of the algorithms. The results of Wilcoxon test indicate that time does not have a significant effect on the performance, and the quality of estimation between the paired algorithms was significant; the results of Friedman test ranked KSA as the best evolutionary algorithm.

  11. A note on the relationships between multiple imputation, maximum likelihood and fully Bayesian methods for missing responses in linear regression models.

    Science.gov (United States)

    Chen, Qingxia; Ibrahim, Joseph G

    2014-07-01

    Multiple Imputation, Maximum Likelihood and Fully Bayesian methods are the three most commonly used model-based approaches in missing data problems. Although it is easy to show that when the responses are missing at random (MAR), the complete case analysis is unbiased and efficient, the aforementioned methods are still commonly used in practice for this setting. To examine the performance of and relationships between these three methods in this setting, we derive and investigate small sample and asymptotic expressions of the estimates and standard errors, and fully examine how these estimates are related for the three approaches in the linear regression model when the responses are MAR. We show that when the responses are MAR in the linear model, the estimates of the regression coefficients using these three methods are asymptotically equivalent to the complete case estimates under general conditions. One simulation and a real data set from a liver cancer clinical trial are given to compare the properties of these methods when the responses are MAR.

  12. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis

    NARCIS (Netherlands)

    Eekhout, I.; Wiel, M.A. van de; Heymans, M.W.

    2017-01-01

    Background. Multiple imputation is a recommended method to handle missing data. For significance testing after multiple imputation, Rubin’s Rules (RR) are easily applied to pool parameter estimates. In a logistic regression model, to consider whether a categorical covariate with more than two levels

  13. Multiple imputation strategies for zero-inflated cost data in economic evaluations : which method works best?

    NARCIS (Netherlands)

    MacNeil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G; van Hout, Hein; de Rooij, Sophia E; Heymans, Martijn W; Bosmans, Judith E

    2016-01-01

    Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing

  14. Cost reduction for web-based data imputation

    KAUST Repository

    Li, Zhixu; Shang, Shuo; Xie, Qing; Zhang, Xiangliang

    2014-01-01

    Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity

  15. Estimation of response from longitudinal binary data with nonignorable missing values in migraine trials

    Directory of Open Access Journals (Sweden)

    Fang Fang

    2016-12-01

    Full Text Available In migraine trials pain relief responses from a headache at specific time points and sustained pain relief response over a period of time are important efficacy measures. When there are missing records of individual time point pain scores and/or headache recurrences during a migraine trial, the common approach used in practice to estimate the sustained response is statistically inconsistent even if the data are missing completely at random. Methods dealing with nonignorable longitudinal missing data usually assume certain models for the missing mechanism which can not be checked as they involve unobserved data. Taking advantage of the specific definition of the ‘sustained pain relief’ response, we propose two estimating methods based on intuitive imputation, which do not require model assumptions on the missing probability or specification of the correlation structure among the longitudinal observations. The consistency of the proposed methods is discussed in theory and their empirical performances are assessed through intensive simulation studies. The simulation results show that the proposed methods perform well in terms of reducing bias and mean square error except in several extreme cases which are unlikely to happen in real trials. The application of the proposed methods is illustrated in a real data analysis.

  16. Practical considerations for sensitivity analysis after multiple imputation applied to epidemiological studies with incomplete data

    Science.gov (United States)

    2012-01-01

    Background Multiple Imputation as usually implemented assumes that data are Missing At Random (MAR), meaning that the underlying missing data mechanism, given the observed data, is independent of the unobserved data. To explore the sensitivity of the inferences to departures from the MAR assumption, we applied the method proposed by Carpenter et al. (2007). This approach aims to approximate inferences under a Missing Not At random (MNAR) mechanism by reweighting estimates obtained after multiple imputation where the weights depend on the assumed degree of departure from the MAR assumption. Methods The method is illustrated with epidemiological data from a surveillance system of hepatitis C virus (HCV) infection in France during the 2001–2007 period. The subpopulation studied included 4343 HCV infected patients who reported drug use. Risk factors for severe liver disease were assessed. After performing complete-case and multiple imputation analyses, we applied the sensitivity analysis to 3 risk factors of severe liver disease: past excessive alcohol consumption, HIV co-infection and infection with HCV genotype 3. Results In these data, the association between severe liver disease and HIV was underestimated, if given the observed data the chance of observing HIV status is high when this is positive. Inference for two other risk factors were robust to plausible local departures from the MAR assumption. Conclusions We have demonstrated the practical utility of, and advocate, a pragmatic widely applicable approach to exploring plausible departures from the MAR assumption post multiple imputation. We have developed guidelines for applying this approach to epidemiological studies. PMID:22681630

  17. A Review of Methods for Missing Data.

    Science.gov (United States)

    Pigott, Therese D.

    2001-01-01

    Reviews methods for handling missing data in a research study. Model-based methods, such as maximum likelihood using the EM algorithm and multiple imputation, hold more promise than ad hoc methods. Although model-based methods require more specialized computer programs and assumptions about the nature of missing data, these methods are appropriate…

  18. Analysis of the progression of systolic blood pressure using imputation of missing phenotype values

    OpenAIRE

    Vaitsiakhovich, Tatsiana; Drichel, Dmitriy; Angisch, Marina; Becker, Tim; Herold, Christine; Lacour, André

    2014-01-01

    We present a genome-wide association study of a quantitative trait, "progression of systolic blood pressure in time," in which 142 unrelated individuals of the Genetic Analysis Workshop 18 real genotype data were analyzed. Information on systolic blood pressure and other phenotypic covariates was missing at certain time points for a considerable part of the sample. We observed that the dropout process causing missingness is not independent of the initial systolic blood pressure; that is, the ...

  19. Missing continuous outcomes under covariate dependent missingness in cluster randomised trials.

    Science.gov (United States)

    Hossain, Anower; Diaz-Ordaz, Karla; Bartlett, Jonathan W

    2017-06-01

    Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group.

  20. Principled Missing Data Treatments.

    Science.gov (United States)

    Lang, Kyle M; Little, Todd D

    2018-04-01

    We review a number of issues regarding missing data treatments for intervention and prevention researchers. Many of the common missing data practices in prevention research are still, unfortunately, ill-advised (e.g., use of listwise and pairwise deletion, insufficient use of auxiliary variables). Our goal is to promote better practice in the handling of missing data. We review the current state of missing data methodology and recent missing data reporting in prevention research. We describe antiquated, ad hoc missing data treatments and discuss their limitations. We discuss two modern, principled missing data treatments: multiple imputation and full information maximum likelihood, and we offer practical tips on how to best employ these methods in prevention research. The principled missing data treatments that we discuss are couched in terms of how they improve causal and statistical inference in the prevention sciences. Our recommendations are firmly grounded in missing data theory and well-validated statistical principles for handling the missing data issues that are ubiquitous in biosocial and prevention research. We augment our broad survey of missing data analysis with references to more exhaustive resources.

  1. Evaluating geographic imputation approaches for zip code level data: an application to a study of pediatric diabetes

    Directory of Open Access Journals (Sweden)

    Puett Robin C

    2009-10-01

    Full Text Available Abstract Background There is increasing interest in the study of place effects on health, facilitated in part by geographic information systems. Incomplete or missing address information reduces geocoding success. Several geographic imputation methods have been suggested to overcome this limitation. Accuracy evaluation of these methods can be focused at the level of individuals and at higher group-levels (e.g., spatial distribution. Methods We evaluated the accuracy of eight geo-imputation methods for address allocation from ZIP codes to census tracts at the individual and group level. The spatial apportioning approaches underlying the imputation methods included four fixed (deterministic and four random (stochastic allocation methods using land area, total population, population under age 20, and race/ethnicity as weighting factors. Data included more than 2,000 geocoded cases of diabetes mellitus among youth aged 0-19 in four U.S. regions. The imputed distribution of cases across tracts was compared to the true distribution using a chi-squared statistic. Results At the individual level, population-weighted (total or under age 20 fixed allocation showed the greatest level of accuracy, with correct census tract assignments averaging 30.01% across all regions, followed by the race/ethnicity-weighted random method (23.83%. The true distribution of cases across census tracts was that 58.2% of tracts exhibited no cases, 26.2% had one case, 9.5% had two cases, and less than 3% had three or more. This distribution was best captured by random allocation methods, with no significant differences (p-value > 0.90. However, significant differences in distributions based on fixed allocation methods were found (p-value Conclusion Fixed imputation methods seemed to yield greatest accuracy at the individual level, suggesting use for studies on area-level environmental exposures. Fixed methods result in artificial clusters in single census tracts. For studies

  2. Traffic speed data imputation method based on tensor completion.

    Science.gov (United States)

    Ran, Bin; Tan, Huachun; Feng, Jianshuai; Liu, Ying; Wang, Wuhong

    2015-01-01

    Traffic speed data plays a key role in Intelligent Transportation Systems (ITS); however, missing traffic data would affect the performance of ITS as well as Advanced Traveler Information Systems (ATIS). In this paper, we handle this issue by a novel tensor-based imputation approach. Specifically, tensor pattern is adopted for modeling traffic speed data and then High accurate Low Rank Tensor Completion (HaLRTC), an efficient tensor completion method, is employed to estimate the missing traffic speed data. This proposed method is able to recover missing entries from given entries, which may be noisy, considering severe fluctuation of traffic speed data compared with traffic volume. The proposed method is evaluated on Performance Measurement System (PeMS) database, and the experimental results show the superiority of the proposed approach over state-of-the-art baseline approaches.

  3. Using imputation to provide location information for nongeocoded addresses.

    Directory of Open Access Journals (Sweden)

    Frank C Curriero

    2010-02-01

    Full Text Available The importance of geography as a source of variation in health research continues to receive sustained attention in the literature. The inclusion of geographic information in such research often begins by adding data to a map which is predicated by some knowledge of location. A precise level of spatial information is conventionally achieved through geocoding, the geographic information system (GIS process of translating mailing address information to coordinates on a map. The geocoding process is not without its limitations, though, since there is always a percentage of addresses which cannot be converted successfully (nongeocodable. This raises concerns regarding bias since traditionally the practice has been to exclude nongeocoded data records from analysis.In this manuscript we develop and evaluate a set of imputation strategies for dealing with missing spatial information from nongeocoded addresses. The strategies are developed assuming a known zip code with increasing use of collateral information, namely the spatial distribution of the population at risk. Strategies are evaluated using prostate cancer data obtained from the Maryland Cancer Registry. We consider total case enumerations at the Census county, tract, and block group level as the outcome of interest when applying and evaluating the methods. Multiple imputation is used to provide estimated total case counts based on complete data (geocodes plus imputed nongeocodes with a measure of uncertainty. Results indicate that the imputation strategy based on using available population-based age, gender, and race information performed the best overall at the county, tract, and block group levels.The procedure allows for the potentially biased and likely under reported outcome, case enumerations based on only the geocoded records, to be presented with a statistically adjusted count (imputed count with a measure of uncertainty that are based on all the case data, the geocodes and imputed

  4. Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines)

    Science.gov (United States)

    Carranza, Emmanuel John M.; Laborte, Alice G.

    2015-01-01

    Machine learning methods that have been used in data-driven predictive modeling of mineral prospectivity (e.g., artificial neural networks) invariably require large number of training prospect/locations and are unable to handle missing values in certain evidential data. The Random Forests (RF) algorithm, which is a machine learning method, has recently been applied to data-driven predictive mapping of mineral prospectivity, and so it is instructive to further study its efficacy in this particular field. This case study, carried out using data from Abra (Philippines), examines (a) if RF modeling can be used for data-driven modeling of mineral prospectivity in areas with a few (i.e., individual layers of evidential data. Furthermore, RF modeling can handle missing values in evidential data through an RF-based imputation technique whereas in WofE modeling values are simply represented by zero weights. Therefore, the RF algorithm is potentially more useful than existing methods that are currently used for data-driven predictive mapping of mineral prospectivity. In particular, it is not a purely black-box method like artificial neural networks in the context of data-driven predictive modeling of mineral prospectivity. However, further testing of the method in other areas with a few mineral occurrences is needed to fully investigate its usefulness in data-driven predictive modeling of mineral prospectivity.

  5. [Study on correction of data bias caused by different missing mechanisms in survey of medical expenditure among students enrolling in Urban Resident Basic Medical Insurance].

    Science.gov (United States)

    Zhang, Haixia; Zhao, Junkang; Gu, Caijiao; Cui, Yan; Rong, Huiying; Meng, Fanlong; Wang, Tong

    2015-05-01

    The study of the medical expenditure and its influencing factors among the students enrolling in Urban Resident Basic Medical Insurance (URBMI) in Taiyuan indicated that non response bias and selection bias coexist in dependent variable of the survey data. Unlike previous studies only focused on one missing mechanism, a two-stage method to deal with two missing mechanisms simultaneously was suggested in this study, combining multiple imputation with sample selection model. A total of 1 190 questionnaires were returned by the students (or their parents) selected in child care settings, schools and universities in Taiyuan by stratified cluster random sampling in 2012. In the returned questionnaires, 2.52% existed not missing at random (NMAR) of dependent variable and 7.14% existed missing at random (MAR) of dependent variable. First, multiple imputation was conducted for MAR by using completed data, then sample selection model was used to correct NMAR in multiple imputation, and a multi influencing factor analysis model was established. Based on 1 000 times resampling, the best scheme of filling the random missing values is the predictive mean matching (PMM) method under the missing proportion. With this optimal scheme, a two stage survey was conducted. Finally, it was found that the influencing factors on annual medical expenditure among the students enrolling in URBMI in Taiyuan included population group, annual household gross income, affordability of medical insurance expenditure, chronic disease, seeking medical care in hospital, seeking medical care in community health center or private clinic, hospitalization, hospitalization canceled due to certain reason, self medication and acceptable proportion of self-paid medical expenditure. The two-stage method combining multiple imputation with sample selection model can deal with non response bias and selection bias effectively in dependent variable of the survey data.

  6. Multiply-Imputed Synthetic Data: Advice to the Imputer

    Directory of Open Access Journals (Sweden)

    Loong Bronwyn

    2017-12-01

    Full Text Available Several statistical agencies have started to use multiply-imputed synthetic microdata to create public-use data in major surveys. The purpose of doing this is to protect the confidentiality of respondents’ identities and sensitive attributes, while allowing standard complete-data analyses of microdata. A key challenge, faced by advocates of synthetic data, is demonstrating that valid statistical inferences can be obtained from such synthetic data for non-confidential questions. Large discrepancies between observed-data and synthetic-data analytic results for such questions may arise because of uncongeniality; that is, differences in the types of inputs available to the imputer, who has access to the actual data, and to the analyst, who has access only to the synthetic data. Here, we discuss a simple, but possibly canonical, example of uncongeniality when using multiple imputation to create synthetic data, which specifically addresses the choices made by the imputer. An initial, unanticipated but not surprising, conclusion is that non-confidential design information used to impute synthetic data should be released with the confidential synthetic data to allow users of synthetic data to avoid possible grossly conservative inferences.

  7. Comparison of Imputation Methods for Handling Missing Categorical Data with Univariate Pattern|| Una comparación de métodos de imputación de variables categóricas con patrón univariado

    Directory of Open Access Journals (Sweden)

    Torres Munguía, Juan Armando

    2014-06-01

    Full Text Available This paper examines the sample proportions estimates in the presence of univariate missing categorical data. A database about smoking habits (2011 National Addiction Survey of Mexico was used to create simulated yet realistic datasets at rates 5% and 15% of missingness, each for MCAR, MAR and MNAR mechanisms. Then the performance of six methods for addressing missingness is evaluated: listwise, mode imputation, random imputation, hot-deck, imputation by polytomous regression and random forests. Results showed that the most effective methods for dealing with missing categorical data in most of the scenarios assessed in this paper were hot-deck and polytomous regression approaches. || El presente estudio examina la estimación de proporciones muestrales en la presencia de valores faltantes en una variable categórica. Se utiliza una encuesta de consumo de tabaco (Encuesta Nacional de Adicciones de México 2011 para crear bases de datos simuladas pero reales con 5% y 15% de valores perdidos para cada mecanismo de no respuesta MCAR, MAR y MNAR. Se evalúa el desempeño de seis métodos para tratar la falta de respuesta: listwise, imputación de moda, imputación aleatoria, hot-deck, imputación por regresión politómica y árboles de clasificación. Los resultados de las simulaciones indican que los métodos más efectivos para el tratamiento de la no respuesta en variables categóricas, bajo los escenarios simulados, son hot-deck y la regresión politómica.

  8. Traffic Speed Data Imputation Method Based on Tensor Completion

    Directory of Open Access Journals (Sweden)

    Bin Ran

    2015-01-01

    Full Text Available Traffic speed data plays a key role in Intelligent Transportation Systems (ITS; however, missing traffic data would affect the performance of ITS as well as Advanced Traveler Information Systems (ATIS. In this paper, we handle this issue by a novel tensor-based imputation approach. Specifically, tensor pattern is adopted for modeling traffic speed data and then High accurate Low Rank Tensor Completion (HaLRTC, an efficient tensor completion method, is employed to estimate the missing traffic speed data. This proposed method is able to recover missing entries from given entries, which may be noisy, considering severe fluctuation of traffic speed data compared with traffic volume. The proposed method is evaluated on Performance Measurement System (PeMS database, and the experimental results show the superiority of the proposed approach over state-of-the-art baseline approaches.

  9. Principal Component Analysis of Process Datasets with Missing Values

    Directory of Open Access Journals (Sweden)

    Kristen A. Severson

    2017-07-01

    Full Text Available Datasets with missing values arising from causes such as sensor failure, inconsistent sampling rates, and merging data from different systems are common in the process industry. Methods for handling missing data typically operate during data pre-processing, but can also occur during model building. This article considers missing data within the context of principal component analysis (PCA, which is a method originally developed for complete data that has widespread industrial application in multivariate statistical process control. Due to the prevalence of missing data and the success of PCA for handling complete data, several PCA algorithms that can act on incomplete data have been proposed. Here, algorithms for applying PCA to datasets with missing values are reviewed. A case study is presented to demonstrate the performance of the algorithms and suggestions are made with respect to choosing which algorithm is most appropriate for particular settings. An alternating algorithm based on the singular value decomposition achieved the best results in the majority of test cases involving process datasets.

  10. 基于距离最大化和缺失数据聚类的填充算法%Missing value imputation algorithm based on distance maximization and missing data clustering

    Institute of Scientific and Technical Information of China (English)

    赵星; 王逊; 黄树成

    2018-01-01

    Through the improvement of the missing value filling algorithm based on K-means clustering,a filling algorithm based on distance maximization and missing data clustering is proposed in this paper.First of all,the original filling algorithm needs to enter the number of clusters in advance.To solve this problem,an improved K-means clustering algorithm is designed.It determines the cluster centers by the maximum distance between the data.It will automatically generate the number of clusters,and improve the efficiency of clustering.Secondly,the clustering distance function is improved.The improved algorithm can be used to cluster the missing value records,thus simplifying the steps of the original filling algorithm.Through experiments on STUDENT ALCOHOL CONSUMPTION data set,experimental results show that the proposed algorithm can improve efficiency and effectively fill the missing data at the same time.%通过对基于K-means聚类的缺失值填充算法的改进,文中提出了基于距离最大化和缺失数据聚类的填充算法.首先,针对原填充算法需要提前输入聚类个数这一缺点,设计了改进的K-means聚类算法:使用数据间的最大距离确定聚类中心,自动产生聚类个数,提高聚类效果;其次,对聚类的距离函数进行改进,采用部分距离度量方式,改进后的算法可以对含有缺失值的记录进行聚类,简化原填充算法步骤.通过对STUDENT ALCOHOL CONSUMPTION数据集的实验,结果证明了该算法能够在提高效率的同时,有效地填充缺失数据.

  11. Criteria of GenCall score to edit marker data and methods to handle missing markers have an influence on accuracy of genomic predictions

    DEFF Research Database (Denmark)

    Edriss, Vahid; Guldbrandtsen, Bernt; Lund, Mogens Sandø

    2013-01-01

    The aim of this study was to investigate the effect of different strategies for handling low-quality or missing data on prediction accuracy for direct genomic values of protein yield, mastitis and fertility using a Bayesian variable model and a GBLUP model in the Danish Jersey population. The data...... contained 1071 Jersey bulls that were genotyped with the Illumina Bovine 50K chip. After preliminary editing, 39227 SNP remained in the dataset. Four methods to handle missing genotypes were: 1) BEAGLE: missing markers were imputed using Beagle 3.3 software, 2) COMMON: missing genotypes at a locus were...

  12. One-Step Dynamic Classifier Ensemble Model for Customer Value Segmentation with Missing Values

    Directory of Open Access Journals (Sweden)

    Jin Xiao

    2014-01-01

    Full Text Available Scientific customer value segmentation (CVS is the base of efficient customer relationship management, and customer credit scoring, fraud detection, and churn prediction all belong to CVS. In real CVS, the customer data usually include lots of missing values, which may affect the performance of CVS model greatly. This study proposes a one-step dynamic classifier ensemble model for missing values (ODCEM model. On the one hand, ODCEM integrates the preprocess of missing values and the classification modeling into one step; on the other hand, it utilizes multiple classifiers ensemble technology in constructing the classification models. The empirical results in credit scoring dataset “German” from UCI and the real customer churn prediction dataset “China churn” show that the ODCEM outperforms four commonly used “two-step” models and the ensemble based model LMF and can provide better decision support for market managers.

  13. Combining item response theory with multiple imputation to equate health assessment questionnaires.

    Science.gov (United States)

    Gu, Chenyang; Gutman, Roee

    2017-09-01

    The assessment of patients' functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set. © 2016, The International Biometric Society.

  14. What are the appropriate methods for analyzing patient-reported outcomes in randomized trials when data are missing?

    Science.gov (United States)

    Hamel, J F; Sebille, V; Le Neel, T; Kubis, G; Boyer, F C; Hardouin, J B

    2017-12-01

    Subjective health measurements using Patient Reported Outcomes (PRO) are increasingly used in randomized trials, particularly for patient groups comparisons. Two main types of analytical strategies can be used for such data: Classical Test Theory (CTT) and Item Response Theory models (IRT). These two strategies display very similar characteristics when data are complete, but in the common case when data are missing, whether IRT or CTT would be the most appropriate remains unknown and was investigated using simulations. We simulated PRO data such as quality of life data. Missing responses to items were simulated as being completely random, depending on an observable covariate or on an unobserved latent trait. The considered CTT-based methods allowed comparing scores using complete-case analysis, personal mean imputations or multiple-imputations based on a two-way procedure. The IRT-based method was the Wald test on a Rasch model including a group covariate. The IRT-based method and the multiple-imputations-based method for CTT displayed the highest observed power and were the only unbiased method whatever the kind of missing data. Online software and Stata® modules compatibles with the innate mi impute suite are provided for performing such analyses. Traditional procedures (listwise deletion and personal mean imputations) should be avoided, due to inevitable problems of biases and lack of power.

  15. Improving accuracy of rare variant imputation with a two-step imputation approach

    DEFF Research Database (Denmark)

    Kreiner-Møller, Eskil; Medina-Gomez, Carolina; Uitterlinden, André G

    2015-01-01

    not being comprehensively scrutinized. Next-generation arrays ensuring sufficient coverage together with new reference panels, as the 1000 Genomes panel, are emerging to facilitate imputation of low frequent single-nucleotide polymorphisms (minor allele frequency (MAF) ... reference sample genotyped on a dense array and hereafter to the 1000 Genomes reference panel. We show that mean imputation quality, measured by the r(2) using this approach, increases by 28% for variants with a MAF between 1 and 5% as compared with direct imputation to 1000 Genomes reference. Similarly......Genotype imputation has been the pillar of the success of genome-wide association studies (GWAS) for identifying common variants associated with common diseases. However, most GWAS have been run using only 60 HapMap samples as reference for imputation, meaning less frequent and rare variants...

  16. Sensitivity analysis in multiple imputation in effectiveness studies of psychotherapy.

    Science.gov (United States)

    Crameri, Aureliano; von Wyl, Agnes; Koemeda, Margit; Schulthess, Peter; Tschuschke, Volker

    2015-01-01

    The importance of preventing and treating incomplete data in effectiveness studies is nowadays emphasized. However, most of the publications focus on randomized clinical trials (RCT). One flexible technique for statistical inference with missing data is multiple imputation (MI). Since methods such as MI rely on the assumption of missing data being at random (MAR), a sensitivity analysis for testing the robustness against departures from this assumption is required. In this paper we present a sensitivity analysis technique based on posterior predictive checking, which takes into consideration the concept of clinical significance used in the evaluation of intra-individual changes. We demonstrate the possibilities this technique can offer with the example of irregular longitudinal data collected with the Outcome Questionnaire-45 (OQ-45) and the Helping Alliance Questionnaire (HAQ) in a sample of 260 outpatients. The sensitivity analysis can be used to (1) quantify the degree of bias introduced by missing not at random data (MNAR) in a worst reasonable case scenario, (2) compare the performance of different analysis methods for dealing with missing data, or (3) detect the influence of possible violations to the model assumptions (e.g., lack of normality). Moreover, our analysis showed that ratings from the patient's and therapist's version of the HAQ could significantly improve the predictive value of the routine outcome monitoring based on the OQ-45. Since analysis dropouts always occur, repeated measurements with the OQ-45 and the HAQ analyzed with MI are useful to improve the accuracy of outcome estimates in quality assurance assessments and non-randomized effectiveness studies in the field of outpatient psychotherapy.

  17. How efficient is estimation with missing data?

    DEFF Research Database (Denmark)

    Karadogan, Seliz; Marchegiani, Letizia; Hansen, Lars Kai

    2011-01-01

    percentages (MDP) using a missing completely at random (MCAR) scheme. We compare three MDTs: pairwise deletion (PW), mean imputation (MI) and a maximum likelihood method that we call complete expectation maximization (CEM). We use a synthetic dataset, the Iris dataset and the Pima Indians Diabetes dataset. We...

  18. Prediction of regulatory gene pairs using dynamic time warping and gene ontology.

    Science.gov (United States)

    Yang, Andy C; Hsu, Hui-Huang; Lu, Ming-Da; Tseng, Vincent S; Shih, Timothy K

    2014-01-01

    Selecting informative genes is the most important task for data analysis on microarray gene expression data. In this work, we aim at identifying regulatory gene pairs from microarray gene expression data. However, microarray data often contain multiple missing expression values. Missing value imputation is thus needed before further processing for regulatory gene pairs becomes possible. We develop a novel approach to first impute missing values in microarray time series data by combining k-Nearest Neighbour (KNN), Dynamic Time Warping (DTW) and Gene Ontology (GO). After missing values are imputed, we then perform gene regulation prediction based on our proposed DTW-GO distance measurement of gene pairs. Experimental results show that our approach is more accurate when compared with existing missing value imputation methods on real microarray data sets. Furthermore, our approach can also discover more regulatory gene pairs that are known in the literature than other methods.

  19. A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method.

    Science.gov (United States)

    Yang, Jun-He; Cheng, Ching-Hsue; Chan, Chia-Pan

    2017-01-01

    Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir's water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.

  20. Towards a more efficient representation of imputation operators in TPOT

    OpenAIRE

    Garciarena, Unai; Mendiburu, Alexander; Santana, Roberto

    2018-01-01

    Automated Machine Learning encompasses a set of meta-algorithms intended to design and apply machine learning techniques (e.g., model selection, hyperparameter tuning, model assessment, etc.). TPOT, a software for optimizing machine learning pipelines based on genetic programming (GP), is a novel example of this kind of applications. Recently we have proposed a way to introduce imputation methods as part of TPOT. While our approach was able to deal with problems with missing data, it can prod...

  1. Hydrologic Response to Climate Change: Missing Precipitation Data Matters for Computed Timing Trends

    Science.gov (United States)

    Daniels, B.

    2016-12-01

    This work demonstrates the derivation of climate timing statistics and applying them to determine resulting hydroclimate impacts. Long-term daily precipitation observations from 50 California stations were used to compute climate trends of precipitation event Intensity, event Duration and Pause between events. Each precipitation event trend was then applied as input to a PRMS hydrology model which showed hydrology changes to recharge, baseflow, streamflow, etc. An important concern was precipitation uncertainty induced by missing observation values and causing errors in quantification of precipitation trends. Many standard statistical techniques such as ARIMA and simple endogenous or even exogenous imputation were applied but failed to help resolve these uncertainties. What helped resolve these uncertainties was use of multiple imputation techniques. This involved fitting of Weibull probability distributions to multiple imputed values for the three precipitation trends.Permutation resampling techniques using Monte Carlo processing were then applied to the multiple imputation values to derive significance p-values for each trend. Significance at the 95% level for Intensity was found for 11 of the 50 stations, Duration from 16 of the 50, and Pause from 19, of which 12 were 99% significant. The significance weighted trends for California are Intensity -4.61% per decade, Duration +3.49% per decade, and Pause +3.58% per decade. Two California basins with PRMS hydrologic models were studied: Feather River in the northern Sierra Nevada mountains and the central coast Soquel-Aptos. Each local trend was changed without changing the other trends or the total precipitation. Feather River Basin's critical supply to Lake Oroville and the State Water Project benefited from a total streamflow increase of 1.5%. The Soquel-Aptos Basin water supply was impacted by a total groundwater recharge decrease of -7.5% and streamflow decrease of -3.2%.

  2. Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research

    Directory of Open Access Journals (Sweden)

    Hardt Jochen

    2012-12-01

    Full Text Available Abstract Background Multiple imputation is becoming increasingly popular. Theoretical considerations as well as simulation studies have shown that the inclusion of auxiliary variables is generally of benefit. Methods A simulation study of a linear regression with a response Y and two predictors X1 and X2 was performed on data with n = 50, 100 and 200 using complete cases or multiple imputation with 0, 10, 20, 40 and 80 auxiliary variables. Mechanisms of missingness were either 100% MCAR or 50% MAR + 50% MCAR. Auxiliary variables had low (r=.10 vs. moderate correlations (r=.50 with X’s and Y. Results The inclusion of auxiliary variables can improve a multiple imputation model. However, inclusion of too many variables leads to downward bias of regression coefficients and decreases precision. When the correlations are low, inclusion of auxiliary variables is not useful. Conclusion More research on auxiliary variables in multiple imputation should be performed. A preliminary rule of thumb could be that the ratio of variables to cases with complete data should not go below 1 : 3.

  3. When and how should multiple imputation be used for handling missing data in randomised clinical trials - a practical guide with flowcharts

    DEFF Research Database (Denmark)

    Jakobsen, Janus Christian; Gluud, Christian; Wetterslev, Jørn

    2017-01-01

    the missingness. Therefore, the analysis of trial data with missing values requires careful planning and attention. METHODS: The authors had several meetings and discussions considering optimal ways of handling missing data to minimise the bias potential. We also searched PubMed (key words: missing data; randomi...

  4. Generalized canonical correlation analysis with missing values

    NARCIS (Netherlands)

    M. van de Velden (Michel); Y. Takane

    2009-01-01

    textabstractTwo new methods for dealing with missing values in generalized canonical correlation analysis are introduced. The first approach, which does not require iterations, is a generalization of the Test Equating method available for principal component analysis. In the second approach,

  5. Simultaneous Treatment of Missing Data and Measurement Error in HIV Research Using Multiple Overimputation.

    Science.gov (United States)

    Schomaker, Michael; Hogger, Sara; Johnson, Leigh F; Hoffmann, Christopher J; Bärnighausen, Till; Heumann, Christian

    2015-09-01

    Both CD4 count and viral load in HIV-infected persons are measured with error. There is no clear guidance on how to deal with this measurement error in the presence of missing data. We used multiple overimputation, a method recently developed in the political sciences, to account for both measurement error and missing data in CD4 count and viral load measurements from four South African cohorts of a Southern African HIV cohort collaboration. Our knowledge about the measurement error of ln CD4 and log10 viral load is part of an imputation model that imputes both missing and mismeasured data. In an illustrative example, we estimate the association of CD4 count and viral load with the hazard of death among patients on highly active antiretroviral therapy by means of a Cox model. Simulation studies evaluate the extent to which multiple overimputation is able to reduce bias in survival analyses. Multiple overimputation emphasizes more strongly the influence of having high baseline CD4 counts compared to both a complete case analysis and multiple imputation (hazard ratio for >200 cells/mm vs. <25 cells/mm: 0.21 [95% confidence interval: 0.18, 0.24] vs. 0.38 [0.29, 0.48], and 0.29 [0.25, 0.34], respectively). Similar results are obtained when varying assumptions about measurement error, when using p-splines, and when evaluating time-updated CD4 count in a longitudinal analysis. The estimates of the association with viral load are slightly more attenuated when using multiple imputation instead of multiple overimputation. Our simulation studies suggest that multiple overimputation is able to reduce bias and mean squared error in survival analyses. Multiple overimputation, which can be used with existing software, offers a convenient approach to account for both missing and mismeasured data in HIV research.

  6. Dealing with missing standard deviation and mean values in meta-analysis of continuous outcomes: a systematic review.

    Science.gov (United States)

    Weir, Christopher J; Butcher, Isabella; Assi, Valentina; Lewis, Stephanie C; Murray, Gordon D; Langhorne, Peter; Brady, Marian C

    2018-03-07

    Rigorous, informative meta-analyses rely on availability of appropriate summary statistics or individual participant data. For continuous outcomes, especially those with naturally skewed distributions, summary information on the mean or variability often goes unreported. While full reporting of original trial data is the ideal, we sought to identify methods for handling unreported mean or variability summary statistics in meta-analysis. We undertook two systematic literature reviews to identify methodological approaches used to deal with missing mean or variability summary statistics. Five electronic databases were searched, in addition to the Cochrane Colloquium abstract books and the Cochrane Statistics Methods Group mailing list archive. We also conducted cited reference searching and emailed topic experts to identify recent methodological developments. Details recorded included the description of the method, the information required to implement the method, any underlying assumptions and whether the method could be readily applied in standard statistical software. We provided a summary description of the methods identified, illustrating selected methods in example meta-analysis scenarios. For missing standard deviations (SDs), following screening of 503 articles, fifteen methods were identified in addition to those reported in a previous review. These included Bayesian hierarchical modelling at the meta-analysis level; summary statistic level imputation based on observed SD values from other trials in the meta-analysis; a practical approximation based on the range; and algebraic estimation of the SD based on other summary statistics. Following screening of 1124 articles for methods estimating the mean, one approximate Bayesian computation approach and three papers based on alternative summary statistics were identified. Illustrative meta-analyses showed that when replacing a missing SD the approximation using the range minimised loss of precision and generally

  7. Imputation of microsatellite alleles from dense SNP genotypes for parental verification

    Directory of Open Access Journals (Sweden)

    Matthew eMcclure

    2012-08-01

    Full Text Available Microsatellite (MS markers have recently been used for parental verification and are still the international standard despite higher cost, error rate, and turnaround time compared with Single Nucleotide Polymorphisms (SNP-based assays. Despite domestic and international interest from producers and research communities, no viable means currently exist to verify parentage for an individual unless all familial connections were analyzed using the same DNA marker type (MS or SNP. A simple and cost-effective method was devised to impute MS alleles from SNP haplotypes within breeds. For some MS, imputation results may allow inference across breeds. A total of 347 dairy cattle representing 4 dairy breeds (Brown Swiss, Guernsey, Holstein, and Jersey were used to generate reference haplotypes. This approach has been verified (>98% accurate for imputing the International Society of Animal Genetics (ISAG recommended panel of 12 MS for cattle parentage verification across a validation set of 1,307 dairy animals.. Implementation of this method will allow producers and breed associations to transition to SNP-based parentage verification utilizing MS genotypes from historical data on parents where SNP genotypes are missing. This approach may be applicable to additional cattle breeds and other species that wish to migrate from MS- to SNP- based parental verification.

  8. ParaHaplo 3.0: A program package for imputation and a haplotype-based whole-genome association study using hybrid parallel computing

    Directory of Open Access Journals (Sweden)

    Kamatani Naoyuki

    2011-05-01

    Full Text Available Abstract Background Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required. Results We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo. Conclusions ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at http://en.sourceforge.jp/projects/parallelgwas/releases/.

  9. A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method

    Directory of Open Access Journals (Sweden)

    Jun-He Yang

    2017-01-01

    Full Text Available Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir’s water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir’s water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.

  10. Multivariate missing data in hydrology - Review and applications

    Science.gov (United States)

    Ben Aissia, Mohamed-Aymen; Chebana, Fateh; Ouarda, Taha B. M. J.

    2017-12-01

    Water resources planning and management require complete data sets of a number of hydrological variables, such as flood peaks and volumes. However, hydrologists are often faced with the problem of missing data (MD) in hydrological databases. Several methods are used to deal with the imputation of MD. During the last decade, multivariate approaches have gained popularity in the field of hydrology, especially in hydrological frequency analysis (HFA). However, treating the MD remains neglected in the multivariate HFA literature whereas the focus has been mainly on the modeling component. For a complete analysis and in order to optimize the use of data, MD should also be treated in the multivariate setting prior to modeling and inference. Imputation of MD in the multivariate hydrological framework can have direct implications on the quality of the estimation. Indeed, the dependence between the series represents important additional information that can be included in the imputation process. The objective of the present paper is to highlight the importance of treating MD in multivariate hydrological frequency analysis by reviewing and applying multivariate imputation methods and by comparing univariate and multivariate imputation methods. An application is carried out for multiple flood attributes on three sites in order to evaluate the performance of the different methods based on the leave-one-out procedure. The results indicate that, the performance of imputation methods can be improved by adopting the multivariate setting, compared to mean substitution and interpolation methods, especially when using the copula-based approach.

  11. Effects of correcting missing daily feed intake values on the genetic parameters and estimated breeding values for feeding traits in pigs.

    Science.gov (United States)

    Ito, Tetsuya; Fukawa, Kazuo; Kamikawa, Mai; Nikaidou, Satoshi; Taniguchi, Masaaki; Arakawa, Aisaku; Tanaka, Genki; Mikawa, Satoshi; Furukawa, Tsutomu; Hirose, Kensuke

    2018-01-01

    Daily feed intake (DFI) is an important consideration for improving feed efficiency, but measurements using electronic feeder systems contain many missing and incorrect values. Therefore, we evaluated three methods for correcting missing DFI data (quadratic, orthogonal polynomial, and locally weighted (Loess) regression equations) and assessed the effects of these missing values on the genetic parameters and the estimated breeding values (EBV) for feeding traits. DFI records were obtained from 1622 Duroc pigs, comprising 902 individuals without missing DFI and 720 individuals with missing DFI. The Loess equation was the most suitable method for correcting the missing DFI values in 5-50% randomly deleted datasets among the three equations. Both variance components and heritability for the average DFI (ADFI) did not change because of the missing DFI proportion and Loess correction. In terms of rank correlation and information criteria, Loess correction improved the accuracy of EBV for ADFI compared to randomly deleted cases. These findings indicate that the Loess equation is useful for correcting missing DFI values for individual pigs and that the correction of missing DFI values could be effective for the estimation of breeding values and genetic improvement using EBV for feeding traits. © 2017 The Authors. Animal Science Journal published by John Wiley & Sons Australia, Ltd on behalf of Japanese Society of Animal Science.

  12. A Note on the Effect of Data Clustering on the Multiple-Imputation Variance Estimator: A Theoretical Addendum to the Lewis et al. article in JOS 2014

    Directory of Open Access Journals (Sweden)

    He Yulei

    2016-03-01

    Full Text Available Multiple imputation is a popular approach to handling missing data. Although it was originally motivated by survey nonresponse problems, it has been readily applied to other data settings. However, its general behavior still remains unclear when applied to survey data with complex sample designs, including clustering. Recently, Lewis et al. (2014 compared single- and multiple-imputation analyses for certain incomplete variables in the 2008 National Ambulatory Medicare Care Survey, which has a nationally representative, multistage, and clustered sampling design. Their study results suggested that the increase of the variance estimate due to multiple imputation compared with single imputation largely disappears for estimates with large design effects. We complement their empirical research by providing some theoretical reasoning. We consider data sampled from an equally weighted, single-stage cluster design and characterize the process using a balanced, one-way normal random-effects model. Assuming that the missingness is completely at random, we derive analytic expressions for the within- and between-multiple-imputation variance estimators for the mean estimator, and thus conveniently reveal the impact of design effects on these variance estimators. We propose approximations for the fraction of missing information in clustered samples, extending previous results for simple random samples. We discuss some generalizations of this research and its practical implications for data release by statistical agencies.

  13. Comparação de métodos de imputação única e múltipla usando como exemplo um modelo de risco para mortalidade cirúrgica Comparison of simple and multiple imputation methods using a risk model for surgical mortality as example

    Directory of Open Access Journals (Sweden)

    Luciana Neves Nunes

    2010-12-01

    Full Text Available INTRODUÇÃO: A perda de informações é um problema frequente em estudos realizados na área da Saúde. Na literatura essa perda é chamada de missing data ou dados faltantes. Através da imputação dos dados faltantes são criados conjuntos de dados artificialmente completos que podem ser analisados por técnicas estatísticas tradicionais. O objetivo desse artigo foi comparar, em um exemplo baseado em dados reais, a utilização de três técnicas de imputações diferentes. MÉTODO: Os dados utilizados referem-se a um estudo de desenvolvimento de modelo de risco cirúrgico, sendo que o tamanho da amostra foi de 450 pacientes. Os métodos de imputação empregados foram duas imputações únicas e uma imputação múltipla (IM, e a suposição sobre o mecanismo de não-resposta foi MAR (Missing at Random. RESULTADOS: A variável com dados faltantes foi a albumina sérica, com 27,1% de perda. Os modelos obtidos pelas imputações únicas foram semelhantes entre si, mas diferentes dos obtidos com os dados imputados pela IM quanto à inclusão de variáveis nos modelos. CONCLUSÕES: Os resultados indicam que faz diferença levar em conta a relação da albumina com outras variáveis observadas, pois foram obtidos modelos diferentes nas imputações única e múltipla. A imputação única subestima a variabilidade, gerando intervalos de confiança mais estreitos. É importante se considerar o uso de métodos de imputação quando há dados faltantes, especialmente a IM que leva em conta a variabilidade entre imputações para as estimativas do modelo.INTRODUCTION: It is common for studies in health to face problems with missing data. Through imputation, complete data sets are built artificially and can be analyzed by traditional statistical analysis. The objective of this paper is to compare three types of imputation based on real data. METHODS: The data used came from a study on the development of risk models for surgical mortality. The

  14. A New Missing Values Estimation Algorithm in Wireless Sensor Networks Based on Convolution

    Directory of Open Access Journals (Sweden)

    Feng Liu

    2013-04-01

    Full Text Available Nowadays, with the rapid development of Internet of Things (IoT applications, data missing phenomenon becomes very common in wireless sensor networks. This problem can greatly and directly threaten the stability and usability of the Internet of things applications which are constructed based on wireless sensor networks. How to estimate the missing value has attracted wide interest, and some solutions have been proposed. Different with the previous works, in this paper, we proposed a new convolution based missing value estimation algorithm. The convolution theory, which is usually used in the area of signal and image processing, can also be a practical and efficient way to estimate the missing sensor data. The results show that the proposed algorithm in this paper is practical and effective, and can estimate the missing value accurately.

  15. Estimated conditional score function for missing mechanism model with nonignorable nonresponse

    Institute of Scientific and Technical Information of China (English)

    CUI Xia; ZHOU Yong

    2017-01-01

    Missing data mechanism often depends on the values of the responses,which leads to nonignorable nonresponses.In such a situation,inference based on approaches that ignore the missing data mechanism could not be valid.A crucial step is to model the nature of missingness.We specify a parametric model for missingness mechanism,and then propose a conditional score function approach for estimation.This approach imputes the score function by taking the conditional expectation of the score function for the missing data given the available information.Inference procedure is then followed by replacing unknown terms with the related nonparametric estimators based on the observed data.The proposed score function does not suffer from the non-identifiability problem,and the proposed estimator is shown to be consistent and asymptotically normal.We also construct a confidence region for the parameter of interest using empirical likelihood method.Simulation studies demonstrate that the proposed inference procedure performs well in many settings.We apply the proposed method to a data set from research in a growth hormone and exercise intervention study.

  16. Statistical analysis with missing exposure data measured by proxy respondents: a misclassification problem within a missing-data problem.

    Science.gov (United States)

    Shardell, Michelle; Hicks, Gregory E

    2014-11-10

    In studies of older adults, researchers often recruit proxy respondents, such as relatives or caregivers, when study participants cannot provide self-reports (e.g., because of illness). Proxies are usually only sought to report on behalf of participants with missing self-reports; thus, either a participant self-report or proxy report, but not both, is available for each participant. Furthermore, the missing-data mechanism for participant self-reports is not identifiable and may be nonignorable. When exposures are binary and participant self-reports are conceptualized as the gold standard, substituting error-prone proxy reports for missing participant self-reports may produce biased estimates of outcome means. Researchers can handle this data structure by treating the problem as one of misclassification within the stratum of participants with missing self-reports. Most methods for addressing exposure misclassification require validation data, replicate data, or an assumption of nondifferential misclassification; other methods may result in an exposure misclassification model that is incompatible with the analysis model. We propose a model that makes none of the aforementioned requirements and still preserves model compatibility. Two user-specified tuning parameters encode the exposure misclassification model. Two proposed approaches estimate outcome means standardized for (potentially) high-dimensional covariates using multiple imputation followed by propensity score methods. The first method is parametric and uses maximum likelihood to estimate the exposure misclassification model (i.e., the imputation model) and the propensity score model (i.e., the analysis model); the second method is nonparametric and uses boosted classification and regression trees to estimate both models. We apply both methods to a study of elderly hip fracture patients. Copyright © 2014 John Wiley & Sons, Ltd.

  17. A new strategy for enhancing imputation quality of rare variants from next-generation sequencing data via combining SNP and exome chip data

    NARCIS (Netherlands)

    Y.J. Kim (Young Jin); J. Lee (Juyoung); B.-J. Kim (Bong-Jo); T. Park (Taesung); G.R. Abecasis (Gonçalo); M.A.A. De Almeida (Marcio); D. Altshuler (David); J.L. Asimit (Jennifer L.); G. Atzmon (Gil); M. Barber (Mathew); A. Barzilai (Ari); N.L. Beer (Nicola L.); G.I. Bell (Graeme I.); J. Below (Jennifer); T. Blackwell (Tom); J. Blangero (John); M. Boehnke (Michael); D.W. Bowden (Donald W.); N.P. Burtt (Noël); J.C. Chambers (John); H. Chen (Han); P. Chen (Ping); P.S. Chines (Peter); S. Choi (Sungkyoung); C. Churchhouse (Claire); P. Cingolani (Pablo); B.K. Cornes (Belinda); N.J. Cox (Nancy); A.G. Day-Williams (Aaron); A. Duggirala (Aparna); J. Dupuis (Josée); T. Dyer (Thomas); S. Feng (Shuang); J. Fernandez-Tajes (Juan); T. Ferreira (Teresa); T.E. Fingerlin (Tasha E.); J. Flannick (Jason); J.C. Florez (Jose); P. Fontanillas (Pierre); T.M. Frayling (Timothy); C. Fuchsberger (Christian); E. Gamazon (Eric); K. Gaulton (Kyle); S. Ghosh (Saurabh); B. Glaser (Benjamin); A.L. Gloyn (Anna); R.L. Grossman (Robert L.); J. Grundstad (Jason); C. Hanis (Craig); A. Heath (Allison); H. Highland (Heather); M. Horikoshi (Momoko); I.-S. Huh (Ik-Soo); J.R. Huyghe (Jeroen R.); M.K. Ikram (Kamran); K.A. Jablonski (Kathleen); Y. Jun (Yang); N. Kato (Norihiro); J. Kim (Jayoun); Y.J. Kim (Young Jin); B.-J. Kim (Bong-Jo); J. Lee (Juyoung); C.R. King (C. Ryan); J.S. Kooner (Jaspal S.); M.-S. Kwon (Min-Seok); H.K. Im (Hae Kyung); M. Laakso (Markku); K.K.-Y. Lam (Kevin Koi-Yau); J. Lee (Jaehoon); S. Lee (Selyeong); S. Lee (Sungyoung); D.M. Lehman (Donna M.); H. Li (Heng); C.M. Lindgren (Cecilia); X. Liu (Xuanyao); O.E. Livne (Oren E.); A.E. Locke (Adam E.); A. Mahajan (Anubha); J.B. Maller (Julian B.); A.K. Manning (Alisa K.); T.J. Maxwell (Taylor J.); A. Mazoure (Alexander); M.I. McCarthy (Mark); J.B. Meigs (James B.); B. Min (Byungju); K.L. Mohlke (Karen); A.P. Morris (Andrew); S. Musani (Solomon); Y. Nagai (Yoshihiko); M.C.Y. Ng (Maggie C.Y.); D. Nicolae (Dan); S. Oh (Sohee); N.D. Palmer (Nicholette); T. Park (Taesung); T.I. Pollin (Toni I.); I. Prokopenko (Inga); D. Reich (David); M.A. Rivas (Manuel); L.J. Scott (Laura); M. Seielstad (Mark); Y.S. Cho (Yoon Shin); X. Sim (Xueling); R. Sladek (Rob); P. Smith (Philip); I. Tachmazidou (Ioanna); E.S. Tai (Shyong); Y.Y. Teo (Yik Ying); T.M. Teslovich (Tanya M.); J. Torres (Jason); V. Trubetskoy (Vasily); S.M. Willems (Sara); A.L. Williams (Amy L.); J.G. Wilson (James); S. Wiltshire (Steven); S. Won (Sungho); A.R. Wood (Andrew); W. Xu (Wang); J. Yoon (Joon); M. Zawistowski (Matthew); E. Zeggini (Eleftheria); W. Zhang (Weihua); S. Zöllner (Sebastian)

    2015-01-01

    textabstractBackground: Rare variants have gathered increasing attention as a possible alternative source of missing heritability. Since next generation sequencing technology is not yet cost-effective for large-scale genomic studies, a widely used alternative approach is imputation. However, the

  18. Near Infrared Spectroscopy to Reduce the Prophylactic Fasciotomies for and Missed Cases of Acute Compartment Syndrome in Solders Injured in OEF/OIF

    Science.gov (United States)

    2015-11-01

    methods at each of the key clinical time points captured. Values of NIRS will be compared between limbs, compartments, and/or patient groups using... imputed to missing by hardcode programming after determination of start and stop times from graphics and/or inspection of raw data files. An example...sensor (Equanox, Nonin, Inc, Plymouth, MN) in diagnosing acute compartment syndrome following severe leg injury. Part 1 was a series of observational

  19. Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level.

    Science.gov (United States)

    Savalei, Victoria; Rhemtulla, Mijke

    2017-08-01

    In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data-that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study.

  20. Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study.

    Science.gov (United States)

    Cornish, R P; Macleod, J; Carpenter, J R; Tilling, K

    2017-01-01

    When an outcome variable is missing not at random (MNAR: probability of missingness depends on outcome values), estimates of the effect of an exposure on this outcome are often biased. We investigated the extent of this bias and examined whether the bias can be reduced through incorporating proxy outcomes obtained through linkage to administrative data as auxiliary variables in multiple imputation (MI). Using data from the Avon Longitudinal Study of Parents and Children (ALSPAC) we estimated the association between breastfeeding and IQ (continuous outcome), incorporating linked attainment data (proxies for IQ) as auxiliary variables in MI models. Simulation studies explored the impact of varying the proportion of missing data (from 20 to 80%), the correlation between the outcome and its proxy (0.1-0.9), the strength of the missing data mechanism, and having a proxy variable that was incomplete. Incorporating a linked proxy for the missing outcome as an auxiliary variable reduced bias and increased efficiency in all scenarios, even when 80% of the outcome was missing. Using an incomplete proxy was similarly beneficial. High correlations (> 0.5) between the outcome and its proxy substantially reduced the missing information. Consistent with this, ALSPAC analysis showed inclusion of a proxy reduced bias and improved efficiency. Gains with additional proxies were modest. In longitudinal studies with loss to follow-up, incorporating proxies for this study outcome obtained via linkage to external sources of data as auxiliary variables in MI models can give practically important bias reduction and efficiency gains when the study outcome is MNAR.

  1. Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel.

    Science.gov (United States)

    Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit

    2017-06-01

    Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies.

  2. Multi-generational imputation of single nucleotide polymorphism marker genotypes and accuracy of genomic selection.

    Science.gov (United States)

    Toghiani, S; Aggrey, S E; Rekaya, R

    2016-07-01

    Availability of high-density single nucleotide polymorphism (SNP) genotyping platforms provided unprecedented opportunities to enhance breeding programmes in livestock, poultry and plant species, and to better understand the genetic basis of complex traits. Using this genomic information, genomic breeding values (GEBVs), which are more accurate than conventional breeding values. The superiority of genomic selection is possible only when high-density SNP panels are used to track genes and QTLs affecting the trait. Unfortunately, even with the continuous decrease in genotyping costs, only a small fraction of the population has been genotyped with these high-density panels. It is often the case that a larger portion of the population is genotyped with low-density and low-cost SNP panels and then imputed to a higher density. Accuracy of SNP genotype imputation tends to be high when minimum requirements are met. Nevertheless, a certain rate of genotype imputation errors is unavoidable. Thus, it is reasonable to assume that the accuracy of GEBVs will be affected by imputation errors; especially, their cumulative effects over time. To evaluate the impact of multi-generational selection on the accuracy of SNP genotypes imputation and the reliability of resulting GEBVs, a simulation was carried out under varying updating of the reference population, distance between the reference and testing sets, and the approach used for the estimation of GEBVs. Using fixed reference populations, imputation accuracy decayed by about 0.5% per generation. In fact, after 25 generations, the accuracy was only 7% lower than the first generation. When the reference population was updated by either 1% or 5% of the top animals in the previous generations, decay of imputation accuracy was substantially reduced. These results indicate that low-density panels are useful, especially when the generational interval between reference and testing population is small. As the generational interval

  3. A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods.

    Science.gov (United States)

    Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A

    2015-12-01

    Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3-40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31-0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04-0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.

  4. Investigating Factorial Invariance of Latent Variables Across Populations When Manifest Variables Are Missing Completely.

    Science.gov (United States)

    Widaman, Keith F; Grimm, Kevin J; Early, Dawnté R; Robins, Richard W; Conger, Rand D

    2013-07-01

    Difficulties arise in multiple-group evaluations of factorial invariance if particular manifest variables are missing completely in certain groups. Ad hoc analytic alternatives can be used in such situations (e.g., deleting manifest variables), but some common approaches, such as multiple imputation, are not viable. At least 3 solutions to this problem are viable: analyzing differing sets of variables across groups, using pattern mixture approaches, and a new method using random number generation. The latter solution, proposed in this article, is to generate pseudo-random normal deviates for all observations for manifest variables that are missing completely in a given sample and then to specify multiple-group models in a way that respects the random nature of these values. An empirical example is presented in detail comparing the 3 approaches. The proposed solution can enable quantitative comparisons at the latent variable level between groups using programs that require the same number of manifest variables in each group.

  5. Missing data in randomized clinical trials for weight loss: scope of the problem, state of the field, and performance of statistical methods.

    Directory of Open Access Journals (Sweden)

    Mai A Elobeid

    2009-08-01

    Full Text Available Dropouts and missing data are nearly-ubiquitous in obesity randomized controlled trails, threatening validity and generalizability of conclusions. Herein, we meta-analytically evaluate the extent of missing data, the frequency with which various analytic methods are employed to accommodate dropouts, and the performance of multiple statistical methods.We searched PubMed and Cochrane databases (2000-2006 for articles published in English and manually searched bibliographic references. Articles of pharmaceutical randomized controlled trials with weight loss or weight gain prevention as major endpoints were included. Two authors independently reviewed each publication for inclusion. 121 articles met the inclusion criteria. Two authors independently extracted treatment, sample size, drop-out rates, study duration, and statistical method used to handle missing data from all articles and resolved disagreements by consensus. In the meta-analysis, drop-out rates were substantial with the survival (non-dropout rates being approximated by an exponential decay curve (e(-lambdat where lambda was estimated to be .0088 (95% bootstrap confidence interval: .0076 to .0100 and t represents time in weeks. The estimated drop-out rate at 1 year was 37%. Most studies used last observation carried forward as the primary analytic method to handle missing data. We also obtained 12 raw obesity randomized controlled trial datasets for empirical analyses. Analyses of raw randomized controlled trial data suggested that both mixed models and multiple imputation performed well, but that multiple imputation may be more robust when missing data are extensive.Our analysis offers an equation for predictions of dropout rates useful for future study planning. Our raw data analyses suggests that multiple imputation is better than other methods for handling missing data in obesity randomized controlled trials, followed closely by mixed models. We suggest these methods supplant last

  6. A methodology for treating missing data applied to daily rainfall data in the Candelaro River Basin (Italy).

    Science.gov (United States)

    Lo Presti, Rossella; Barca, Emanuele; Passarella, Giuseppe

    2010-01-01

    Environmental time series are often affected by the "presence" of missing data, but when dealing statistically with data, the need to fill in the gaps estimating the missing values must be considered. At present, a large number of statistical techniques are available to achieve this objective; they range from very simple methods, such as using the sample mean, to very sophisticated ones, such as multiple imputation. A brand new methodology for missing data estimation is proposed, which tries to merge the obvious advantages of the simplest techniques (e.g. their vocation to be easily implemented) with the strength of the newest techniques. The proposed method consists in the application of two consecutive stages: once it has been ascertained that a specific monitoring station is affected by missing data, the "most similar" monitoring stations are identified among neighbouring stations on the basis of a suitable similarity coefficient; in the second stage, a regressive method is applied in order to estimate the missing data. In this paper, four different regressive methods are applied and compared, in order to determine which is the most reliable for filling in the gaps, using rainfall data series measured in the Candelaro River Basin located in South Italy.

  7. Estimating the accuracy of geographical imputation

    Directory of Open Access Journals (Sweden)

    Boscoe Francis P

    2008-01-01

    Full Text Available Abstract Background To reduce the number of non-geocoded cases researchers and organizations sometimes include cases geocoded to postal code centroids along with cases geocoded with the greater precision of a full street address. Some analysts then use the postal code to assign information to the cases from finer-level geographies such as a census tract. Assignment is commonly completed using either a postal centroid or by a geographical imputation method which assigns a location by using both the demographic characteristics of the case and the population characteristics of the postal delivery area. To date no systematic evaluation of geographical imputation methods ("geo-imputation" has been completed. The objective of this study was to determine the accuracy of census tract assignment using geo-imputation. Methods Using a large dataset of breast, prostate and colorectal cancer cases reported to the New Jersey Cancer Registry, we determined how often cases were assigned to the correct census tract using alternate strategies of demographic based geo-imputation, and using assignments obtained from postal code centroids. Assignment accuracy was measured by comparing the tract assigned with the tract originally identified from the full street address. Results Assigning cases to census tracts using the race/ethnicity population distribution within a postal code resulted in more correctly assigned cases than when using postal code centroids. The addition of age characteristics increased the match rates even further. Match rates were highly dependent on both the geographic distribution of race/ethnicity groups and population density. Conclusion Geo-imputation appears to offer some advantages and no serious drawbacks as compared with the alternative of assigning cases to census tracts based on postal code centroids. For a specific analysis, researchers will still need to consider the potential impact of geocoding quality on their results and evaluate

  8. Missing data in forest ecology and management: advances in quantitative methods [Preface

    Science.gov (United States)

    Tara Barrett; Matti Maltomo

    2012-01-01

    In recent years, substantial progress has been made for handling missing data issues for applications in forest ecology and management, particularly in the area of imputation techniques. A session on this topic was held at the XXlll IUFRO World Congress in Seoul, South Korea, on August 23-28, 2010, resulting in this special issue of six papers that address recent...

  9. Villa Marie Nursing Home, Grange, Templemore Road, Roscrea, Tipperary.

    LENUS (Irish Health Repository)

    Hardouin, Jean-Benoit

    2011-07-14

    Abstract Background Nowadays, more and more clinical scales consisting in responses given by the patients to some items (Patient Reported Outcomes - PRO), are validated with models based on Item Response Theory, and more specifically, with a Rasch model. In the validation sample, presence of missing data is frequent. The aim of this paper is to compare sixteen methods for handling the missing data (mainly based on simple imputation) in the context of psychometric validation of PRO by a Rasch model. The main indexes used for validation by a Rasch model are compared. Methods A simulation study was performed allowing to consider several cases, notably the possibility for the missing values to be informative or not and the rate of missing data. Results Several imputations methods produce bias on psychometrical indexes (generally, the imputation methods artificially improve the psychometric qualities of the scale). In particular, this is the case with the method based on the Personal Mean Score (PMS) which is the most commonly used imputation method in practice. Conclusions Several imputation methods should be avoided, in particular PMS imputation. From a general point of view, it is important to use an imputation method that considers both the ability of the patient (measured for example by his\\/her score), and the difficulty of the item (measured for example by its rate of favourable responses). Another recommendation is to always consider the addition of a random process in the imputation method, because such a process allows reducing the bias. Last, the analysis realized without imputation of the missing data (available case analyses) is an interesting alternative to the simple imputation in this context.

  10. Building Chaotic Model From Incomplete Time Series

    Science.gov (United States)

    Siek, Michael; Solomatine, Dimitri

    2010-05-01

    This paper presents a number of novel techniques for building a predictive chaotic model from incomplete time series. A predictive chaotic model is built by reconstructing the time-delayed phase space from observed time series and the prediction is made by a global model or adaptive local models based on the dynamical neighbors found in the reconstructed phase space. In general, the building of any data-driven models depends on the completeness and quality of the data itself. However, the completeness of the data availability can not always be guaranteed since the measurement or data transmission is intermittently not working properly due to some reasons. We propose two main solutions dealing with incomplete time series: using imputing and non-imputing methods. For imputing methods, we utilized the interpolation methods (weighted sum of linear interpolations, Bayesian principle component analysis and cubic spline interpolation) and predictive models (neural network, kernel machine, chaotic model) for estimating the missing values. After imputing the missing values, the phase space reconstruction and chaotic model prediction are executed as a standard procedure. For non-imputing methods, we reconstructed the time-delayed phase space from observed time series with missing values. This reconstruction results in non-continuous trajectories. However, the local model prediction can still be made from the other dynamical neighbors reconstructed from non-missing values. We implemented and tested these methods to construct a chaotic model for predicting storm surges at Hoek van Holland as the entrance of Rotterdam Port. The hourly surge time series is available for duration of 1990-1996. For measuring the performance of the proposed methods, a synthetic time series with missing values generated by a particular random variable to the original (complete) time series is utilized. There exist two main performance measures used in this work: (1) error measures between the actual

  11. The comparison of cardiovascular risk scores using two methods of substituting missing risk factor data in patient medical records

    Directory of Open Access Journals (Sweden)

    Andrew Dalton

    2011-07-01

    Conclusions A simple method of substituting missing risk factor data can produce reliable estimates of CVD risk scores. Targeted screening for high CVD risk, using pre-existing electronic medical record data, does not require multiple imputation methods in risk estimation.

  12. mvp - an open-source preprocessor for cleaning duplicate records and missing values in mass spectrometry data.

    Science.gov (United States)

    Lee, Geunho; Lee, Hyun Beom; Jung, Byung Hwa; Nam, Hojung

    2017-07-01

    Mass spectrometry (MS) data are used to analyze biological phenomena based on chemical species. However, these data often contain unexpected duplicate records and missing values due to technical or biological factors. These 'dirty data' problems increase the difficulty of performing MS analyses because they lead to performance degradation when statistical or machine-learning tests are applied to the data. Thus, we have developed missing values preprocessor (mvp), an open-source software for preprocessing data that might include duplicate records and missing values. mvp uses the property of MS data in which identical chemical species present the same or similar values for key identifiers, such as the mass-to-charge ratio and intensity signal, and forms cliques via graph theory to process dirty data. We evaluated the validity of the mvp process via quantitative and qualitative analyses and compared the results from a statistical test that analyzed the original and mvp-applied data. This analysis showed that using mvp reduces problems associated with duplicate records and missing values. We also examined the effects of using unprocessed data in statistical tests and examined the improved statistical test results obtained with data preprocessed using mvp.

  13. Imputation and quality control steps for combining multiple genome-wide datasets

    Directory of Open Access Journals (Sweden)

    Shefali S Verma

    2014-12-01

    Full Text Available The electronic MEdical Records and GEnomics (eMERGE network brings together DNA biobanks linked to electronic health records (EHRs from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes, and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2 were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

  14. Results of Database Studies in Spine Surgery Can Be Influenced by Missing Data.

    Science.gov (United States)

    Basques, Bryce A; McLynn, Ryan P; Fice, Michael P; Samuel, Andre M; Lukasiewicz, Adam M; Bohl, Daniel D; Ahn, Junyoung; Singh, Kern; Grauer, Jonathan N

    2017-12-01

    was computed and imputed. In the third regression, any variables with > 10% rate of missing data were removed from the regression; among variables with ≤ 10% missing data, individual cases with missing values were excluded. The results of these regressions were compared to determine how the different treatments of missing data could affect the results of spine studies using the ACS-NSQIP database. Of the 88,471 patients, as many as 4441 (5%) had missing elements among demographic data, 69,184 (72%) among comorbidities, 70,892 (80%) among preoperative laboratory values, and 56,551 (64%) among operating room times. Considering the three different treatments of missing data, we found different risk factors for adverse events. Of 44 risk factors found to be associated with adverse events in any analysis, only 15 (34%) of these risk factors were common among the three regressions. The second treatment of missing data (assuming "normal" value) found the most risk factors (40) to be associated with any adverse event, whereas the first treatment (deleting patients with missing data) found the fewest associations at 20. Among the risk factors associated with any adverse event, the 10 with the greatest effect size (odds ratio) by each regression were ranked. Of the 15 variables in the top 10 for any regression, six of these were common among all three lists. Differing treatments of missing data can influence the results of spine studies using the ACS-NSQIP. The current study highlights the importance of considering how such missing data are handled. Until there are better guidelines on the best approaches to handle missing data, investigators should report how missing data were handled to increase the quality and transparency of orthopaedic database research. Readers of large database studies should note whether handling of missing data was addressed and consider potential bias with high rates or unspecified or weak methods for handling missing data.

  15. GACT: a Genome build and Allele definition Conversion Tool for SNP imputation and meta-analysis in genetic association studies.

    Science.gov (United States)

    Sulovari, Arvis; Li, Dawei

    2014-07-19

    Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions. In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs. GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep

  16. Performance of bias-correction methods for exposure measurement error using repeated measurements with and without missing data.

    Science.gov (United States)

    Batistatou, Evridiki; McNamee, Roseanne

    2012-12-10

    It is known that measurement error leads to bias in assessing exposure effects, which can however, be corrected if independent replicates are available. For expensive replicates, two-stage (2S) studies that produce data 'missing by design', may be preferred over a single-stage (1S) study, because in the second stage, measurement of replicates is restricted to a sample of first-stage subjects. Motivated by an occupational study on the acute effect of carbon black exposure on respiratory morbidity, we compare the performance of several bias-correction methods for both designs in a simulation study: an instrumental variable method (EVROS IV) based on grouping strategies, which had been recommended especially when measurement error is large, the regression calibration and the simulation extrapolation methods. For the 2S design, either the problem of 'missing' data was ignored or the 'missing' data were imputed using multiple imputations. Both in 1S and 2S designs, in the case of small or moderate measurement error, regression calibration was shown to be the preferred approach in terms of root mean square error. For 2S designs, regression calibration as implemented by Stata software is not recommended in contrast to our implementation of this method; the 'problematic' implementation of regression calibration although substantially improved with use of multiple imputations. The EVROS IV method, under a good/fairly good grouping, outperforms the regression calibration approach in both design scenarios when exposure mismeasurement is severe. Both in 1S and 2S designs with moderate or large measurement error, simulation extrapolation severely failed to correct for bias. Copyright © 2012 John Wiley & Sons, Ltd.

  17. Amplification of DNA mixtures--Missing data approach

    DEFF Research Database (Denmark)

    Tvedebrink, Torben; Eriksen, Poul Svante; Mogensen, Helle Smidt

    2008-01-01

    This paper presents a model for the interpretation of results of STR typing of DNA mixtures based on a multivariate normal distribution of peak areas. From previous analyses of controlled experiments with mixed DNA samples, we exploit the linear relationship between peak heights and peak areas...... DNA samples, it is only possible to observe the cumulative peak heights and areas. Complying with this latent structure, we use the EM-algorithm to impute the missing variables based on a compound symmetry model. That is the measurements are subject to intra- and inter-loci correlations not depending...... on the actual alleles of the DNA profiles. Due to factorization of the likelihood, properties of the normal distribution and use of auxiliary variables, an ordinary implementation of the EM-algorithm solves the missing data problem. We estimate the parameters in the model based on a training data set. In order...

  18. Principle Component Analysis with Incomplete Data: A simulation of R pcaMethods package in Constructing an Environmental Quality Index with Missing Data

    Science.gov (United States)

    Missing data is a common problem in the application of statistical techniques. In principal component analysis (PCA), a technique for dimensionality reduction, incomplete data points are either discarded or imputed using interpolation methods. Such approaches are less valid when ...

  19. Assessing accuracy of genotype imputation in American Indians.

    Directory of Open Access Journals (Sweden)

    Alka Malhotra

    Full Text Available Genotype imputation is commonly used in genetic association studies to test untyped variants using information on linkage disequilibrium (LD with typed markers. Imputing genotypes requires a suitable reference population in which the LD pattern is known, most often one selected from HapMap. However, some populations, such as American Indians, are not represented in HapMap. In the present study, we assessed accuracy of imputation using HapMap reference populations in a genome-wide association study in Pima Indians.Data from six randomly selected chromosomes were used. Genotypes in the study population were masked (either 1% or 20% of SNPs available for a given chromosome. The masked genotypes were then imputed using the software Markov Chain Haplotyping Algorithm. Using four HapMap reference populations, average genotype error rates ranged from 7.86% for Mexican Americans to 22.30% for Yoruba. In contrast, use of the original Pima Indian data as a reference resulted in an average error rate of 1.73%.Our results suggest that the use of HapMap reference populations results in substantial inaccuracy in the imputation of genotypes in American Indians. A possible solution would be to densely genotype or sequence a reference American Indian population.

  20. Highly accurate sequence imputation enables precise QTL mapping in Brown Swiss cattle.

    Science.gov (United States)

    Frischknecht, Mirjam; Pausch, Hubert; Bapst, Beat; Signer-Hasler, Heidi; Flury, Christine; Garrick, Dorian; Stricker, Christian; Fries, Ruedi; Gredler-Grandl, Birgit

    2017-12-29

    Within the last few years a large amount of genomic information has become available in cattle. Densities of genomic information vary from a few thousand variants up to whole genome sequence information. In order to combine genomic information from different sources and infer genotypes for a common set of variants, genotype imputation is required. In this study we evaluated the accuracy of imputation from high density chips to whole genome sequence data in Brown Swiss cattle. Using four popular imputation programs (Beagle, FImpute, Impute2, Minimac) and various compositions of reference panels, the accuracy of the imputed sequence variant genotypes was high and differences between the programs and scenarios were small. We imputed sequence variant genotypes for more than 1600 Brown Swiss bulls and performed genome-wide association studies for milk fat percentage at two stages of lactation. We found one and three quantitative trait loci for early and late lactation fat content, respectively. Known causal variants that were imputed from the sequenced reference panel were among the most significantly associated variants of the genome-wide association study. Our study demonstrates that whole-genome sequence information can be imputed at high accuracy in cattle populations. Using imputed sequence variant genotypes in genome-wide association studies may facilitate causal variant detection.

  1. Missing data in trial-based cost-effectiveness analysis: An incomplete journey.

    Science.gov (United States)

    Leurent, Baptiste; Gomes, Manuel; Carpenter, James R

    2018-06-01

    Cost-effectiveness analyses (CEA) conducted alongside randomised trials provide key evidence for informing healthcare decision making, but missing data pose substantive challenges. Recently, there have been a number of developments in methods and guidelines addressing missing data in trials. However, it is unclear whether these developments have permeated CEA practice. This paper critically reviews the extent of and methods used to address missing data in recently published trial-based CEA. Issues of the Health Technology Assessment journal from 2013 to 2015 were searched. Fifty-two eligible studies were identified. Missing data were very common; the median proportion of trial participants with complete cost-effectiveness data was 63% (interquartile range: 47%-81%). The most common approach for the primary analysis was to restrict analysis to those with complete data (43%), followed by multiple imputation (30%). Half of the studies conducted some sort of sensitivity analyses, but only 2 (4%) considered possible departures from the missing-at-random assumption. Further improvements are needed to address missing data in cost-effectiveness analyses conducted alongside randomised trials. These should focus on limiting the extent of missing data, choosing an appropriate method for the primary analysis that is valid under contextually plausible assumptions, and conducting sensitivity analyses to departures from the missing-at-random assumption. © 2018 The Authors Health Economics published by John Wiley & Sons Ltd.

  2. Managing missing scores on the Roland Morris Disability Questionnaire

    DEFF Research Database (Denmark)

    Lauridsen, Henrik Hein

    2009-01-01

    systematically dropped from each person’s raw scores and the standardized score was proportionally recalculated. This process was repeated until 6 questions had been dropped from each person’s questionnaire. ·         The error (absolute and percentage) introduced by each level of dropped question was calculated......MANAGING MISSING SCORES ON THE ROLAND MORRIS DISABILITY QUESTIONNAIRE  Peter Kenta and Henrik Hein Lauridsenb  aBack Research Centre and bInstitute of Sports Science and Clinical Biomechanics, University of Southern Denmark Background There is no standard method to calculate Roland Morris...... Disability Questionnaire (RMDQ) sum scores when one or more questions have not been answered. However, missing data are common on the RMDQ and the current options are: calculate a sum score regardless of unanswered questions, reject all data containing unanswered questions, or to impute scores. Other...

  3. Tensor Completion for Estimating Missing Values in Visual Data

    KAUST Repository

    Liu, Ji

    2012-01-25

    In this paper, we propose an algorithm to estimate missing values in tensors of visual data. The values can be missing due to problems in the acquisition process or because the user manually identified unwanted outliers. Our algorithm works even with a small amount of samples and it can propagate structure to fill larger missing regions. Our methodology is built on recent studies about matrix completion using the matrix trace norm. The contribution of our paper is to extend the matrix case to the tensor case by proposing the first definition of the trace norm for tensors and then by building a working algorithm. First, we propose a definition for the tensor trace norm that generalizes the established definition of the matrix trace norm. Second, similarly to matrix completion, the tensor completion is formulated as a convex optimization problem. Unfortunately, the straightforward problem extension is significantly harder to solve than the matrix case because of the dependency among multiple constraints. To tackle this problem, we developed three algorithms: simple low rank tensor completion (SiLRTC), fast low rank tensor completion (FaLRTC), and high accuracy low rank tensor completion (HaLRTC). The SiLRTC algorithm is simple to implement and employs a relaxation technique to separate the dependant relationships and uses the block coordinate descent (BCD) method to achieve a globally optimal solution; the FaLRTC algorithm utilizes a smoothing scheme to transform the original nonsmooth problem into a smooth one and can be used to solve a general tensor trace norm minimization problem; the HaLRTC algorithm applies the alternating direction method of multipliers (ADMMs) to our problem. Our experiments show potential applications of our algorithms and the quantitative evaluation indicates that our methods are more accurate and robust than heuristic approaches. The efficiency comparison indicates that FaLTRC and HaLRTC are more efficient than SiLRTC and between Fa

  4. Tensor Completion for Estimating Missing Values in Visual Data

    KAUST Repository

    Liu, Ji; Musialski, Przemyslaw; Wonka, Peter; Ye, Jieping

    2012-01-01

    In this paper, we propose an algorithm to estimate missing values in tensors of visual data. The values can be missing due to problems in the acquisition process or because the user manually identified unwanted outliers. Our algorithm works even with a small amount of samples and it can propagate structure to fill larger missing regions. Our methodology is built on recent studies about matrix completion using the matrix trace norm. The contribution of our paper is to extend the matrix case to the tensor case by proposing the first definition of the trace norm for tensors and then by building a working algorithm. First, we propose a definition for the tensor trace norm that generalizes the established definition of the matrix trace norm. Second, similarly to matrix completion, the tensor completion is formulated as a convex optimization problem. Unfortunately, the straightforward problem extension is significantly harder to solve than the matrix case because of the dependency among multiple constraints. To tackle this problem, we developed three algorithms: simple low rank tensor completion (SiLRTC), fast low rank tensor completion (FaLRTC), and high accuracy low rank tensor completion (HaLRTC). The SiLRTC algorithm is simple to implement and employs a relaxation technique to separate the dependant relationships and uses the block coordinate descent (BCD) method to achieve a globally optimal solution; the FaLRTC algorithm utilizes a smoothing scheme to transform the original nonsmooth problem into a smooth one and can be used to solve a general tensor trace norm minimization problem; the HaLRTC algorithm applies the alternating direction method of multipliers (ADMMs) to our problem. Our experiments show potential applications of our algorithms and the quantitative evaluation indicates that our methods are more accurate and robust than heuristic approaches. The efficiency comparison indicates that FaLTRC and HaLRTC are more efficient than SiLRTC and between Fa

  5. Tensor completion for estimating missing values in visual data.

    Science.gov (United States)

    Liu, Ji; Musialski, Przemyslaw; Wonka, Peter; Ye, Jieping

    2013-01-01

    In this paper, we propose an algorithm to estimate missing values in tensors of visual data. The values can be missing due to problems in the acquisition process or because the user manually identified unwanted outliers. Our algorithm works even with a small amount of samples and it can propagate structure to fill larger missing regions. Our methodology is built on recent studies about matrix completion using the matrix trace norm. The contribution of our paper is to extend the matrix case to the tensor case by proposing the first definition of the trace norm for tensors and then by building a working algorithm. First, we propose a definition for the tensor trace norm that generalizes the established definition of the matrix trace norm. Second, similarly to matrix completion, the tensor completion is formulated as a convex optimization problem. Unfortunately, the straightforward problem extension is significantly harder to solve than the matrix case because of the dependency among multiple constraints. To tackle this problem, we developed three algorithms: simple low rank tensor completion (SiLRTC), fast low rank tensor completion (FaLRTC), and high accuracy low rank tensor completion (HaLRTC). The SiLRTC algorithm is simple to implement and employs a relaxation technique to separate the dependent relationships and uses the block coordinate descent (BCD) method to achieve a globally optimal solution; the FaLRTC algorithm utilizes a smoothing scheme to transform the original nonsmooth problem into a smooth one and can be used to solve a general tensor trace norm minimization problem; the HaLRTC algorithm applies the alternating direction method of multipliers (ADMMs) to our problem. Our experiments show potential applications of our algorithms and the quantitative evaluation indicates that our methods are more accurate and robust than heuristic approaches. The efficiency comparison indicates that FaLTRC and HaLRTC are more efficient than SiLRTC and between FaLRTC an

  6. Sensitivity Analysis for Not-at-Random Missing Data in Trial-Based Cost-Effectiveness Analysis: A Tutorial.

    Science.gov (United States)

    Leurent, Baptiste; Gomes, Manuel; Faria, Rita; Morris, Stephen; Grieve, Richard; Carpenter, James R

    2018-04-20

    Cost-effectiveness analyses (CEA) of randomised controlled trials are a key source of information for health care decision makers. Missing data are, however, a common issue that can seriously undermine their validity. A major concern is that the chance of data being missing may be directly linked to the unobserved value itself [missing not at random (MNAR)]. For example, patients with poorer health may be less likely to complete quality-of-life questionnaires. However, the extent to which this occurs cannot be ascertained from the data at hand. Guidelines recommend conducting sensitivity analyses to assess the robustness of conclusions to plausible MNAR assumptions, but this is rarely done in practice, possibly because of a lack of practical guidance. This tutorial aims to address this by presenting an accessible framework and practical guidance for conducting sensitivity analysis for MNAR data in trial-based CEA. We review some of the methods for conducting sensitivity analysis, but focus on one particularly accessible approach, where the data are multiply-imputed and then modified to reflect plausible MNAR scenarios. We illustrate the implementation of this approach on a weight-loss trial, providing the software code. We then explore further issues around its use in practice.

  7. Moderation analysis with missing data in the predictors.

    Science.gov (United States)

    Zhang, Qian; Wang, Lijuan

    2017-12-01

    The most widely used statistical model for conducting moderation analysis is the moderated multiple regression (MMR) model. In MMR modeling, missing data could pose a challenge, mainly because the interaction term is a product of two or more variables and thus is a nonlinear function of the involved variables. In this study, we consider a simple MMR model, where the effect of the focal predictor X on the outcome Y is moderated by a moderator U. The primary interest is to find ways of estimating and testing the moderation effect with the existence of missing data in X. We mainly focus on cases when X is missing completely at random (MCAR) and missing at random (MAR). Three methods are compared: (a) Normal-distribution-based maximum likelihood estimation (NML); (b) Normal-distribution-based multiple imputation (NMI); and (c) Bayesian estimation (BE). Via simulations, we found that NML and NMI could lead to biased estimates of moderation effects under MAR missingness mechanism. The BE method outperformed NMI and NML for MMR modeling with missing data in the focal predictor, missingness depending on the moderator and/or auxiliary variables, and correctly specified distributions for the focal predictor. In addition, more robust BE methods are needed in terms of the distribution mis-specification problem of the focal predictor. An empirical example was used to illustrate the applications of the methods with a simple sensitivity analysis. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  8. Use of missing data methods in longitudinal studies: the persistence of bad practices in developmental psychology.

    Science.gov (United States)

    Jelicić, Helena; Phelps, Erin; Lerner, Richard M

    2009-07-01

    Developmental science rests on describing, explaining, and optimizing intraindividual changes and, hence, empirically requires longitudinal research. Problems of missing data arise in most longitudinal studies, thus creating challenges for interpreting the substance and structure of intraindividual change. Using a sample of reports of longitudinal studies obtained from three flagship developmental journals-Child Development, Developmental Psychology, and Journal of Research on Adolescence-we examined the number of longitudinal studies reporting missing data and the missing data techniques used. Of the 100 longitudinal studies sampled, 57 either reported having missing data or had discrepancies in sample sizes reported for different analyses. The majority of these studies (82%) used missing data techniques that are statistically problematic (either listwise deletion or pairwise deletion) and not among the methods recommended by statisticians (i.e., the direct maximum likelihood method and the multiple imputation method). Implications of these results for developmental theory and application, and the need for understanding the consequences of using statistically inappropriate missing data techniques with actual longitudinal data sets, are discussed.

  9. Performance of genotype imputation for low frequency and rare variants from the 1000 genomes.

    Science.gov (United States)

    Zheng, Hou-Feng; Rong, Jing-Jing; Liu, Ming; Han, Fang; Zhang, Xing-Wei; Richards, J Brent; Wang, Li

    2015-01-01

    Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF ≤ 0.3%), only 0-1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.

  10. Public Undertakings and Imputability

    DEFF Research Database (Denmark)

    Ølykke, Grith Skovgaard

    2013-01-01

    In this article, the issue of impuability to the State of public undertakings’ decision-making is analysed and discussed in the context of the DSBFirst case. DSBFirst is owned by the independent public undertaking DSB and the private undertaking FirstGroup plc and won the contracts in the 2008...... Oeresund tender for the provision of passenger transport by railway. From the start, the services were provided at a loss, and in the end a part of DSBFirst was wound up. In order to frame the problems illustrated by this case, the jurisprudence-based imputability requirement in the definition of State aid...... in Article 107(1) TFEU is analysed. It is concluded that where the public undertaking transgresses the control system put in place by the State, conditions for imputability are not fulfilled, and it is argued that in the current state of law, there is no conditional link between the level of control...

  11. Missing data in trauma registries: A systematic review.

    Science.gov (United States)

    Shivasabesan, Gowri; Mitra, Biswadev; O'Reilly, Gerard M

    2018-03-30

    Trauma registries play an integral role in trauma systems but their valid use hinges on data quality. The aim of this study was to determine, among contemporary publications using trauma registry data, the level of reporting of data completeness and the methods used to deal with missing data. A systematic review was conducted of all trauma registry-based manuscripts published from 01 January 2015 to current date (17 March 2017). Studies were identified by searching MEDLINE, EMBASE, and CINAHL using relevant subject headings and keywords. Included manuscripts were evaluated based on previously published recommendations regarding the reporting and discussion of missing data. Manuscripts were graded on their degree of characterization of such observations. In addition, the methods used to manage missing data were examined. There were 539 manuscripts that met inclusion criteria. Among these, 208 (38.6%) manuscripts did not mention data completeness and 88 (16.3%) mentioned missing data but did not quantify the extent. Only a handful (n = 26; 4.8%) quantified the 'missingness' of all variables. Most articles (n = 477; 88.5%) contained no details such as a comparison between patient characteristics in cohorts with and without missing data. Of the 331 articles which made at least some mention of data completeness, the method of managing missing data was unknown in 34 (10.3%). When method(s) to handle missing data were identified, 234 (78.8%) manuscripts used complete case analysis only, 18 (6.1%) used multiple imputation only and 34 (11.4%) used a combination of these. Most manuscripts using trauma registry data did not quantify the extent of missing data for any variables and contained minimal discussion regarding missingness. Out of the studies which identified a method of managing missing data, most used complete case analysis, a method that may bias results. The lack of standardization in the reporting and management of missing data questions the validity of

  12. Evaluation and application of summary statistic imputation to discover new height-associated loci.

    Science.gov (United States)

    Rüeger, Sina; McDaid, Aaron; Kutalik, Zoltán

    2018-05-01

    As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian

  13. SOFTWARE EFFORT PREDICTION: AN EMPIRICAL EVALUATION OF METHODS TO TREAT MISSING VALUES WITH RAPIDMINER ®

    OpenAIRE

    OLGA FEDOTOVA; GLADYS CASTILLO; LEONOR TEIXEIRA; HELENA ALVELOS

    2011-01-01

    Missing values is a common problem in the data analysis in all areas, being software engineering not an exception. Particularly, missing data is a widespread phenomenon observed during the elaboration of effort prediction models (EPMs) required for budget, time and functionalities planning. Current work presents the results of a study carried out on a Portuguese medium-sized software development organization in order to obtain a formal method for EPMs elicitation in development processes. Thi...

  14. Imputed prices of greenhouse gases and land forests

    International Nuclear Information System (INIS)

    Uzawa, Hirofumi

    1993-01-01

    The theory of dynamic optimum formulated by Maeler gives us the basic theoretical framework within which it is possible to analyse the economic and, possibly, political circumstances under which the phenomenon of global warming occurs, and to search for the policy and institutional arrangements whereby it would be effectively arrested. The analysis developed here is an application of Maeler's theory to atmospheric quality. In the analysis a central role is played by the concept of imputed price in the dynamic context. Our determination of imputed prices of atmospheric carbon dioxide and land forests takes into account the difference in the stages of economic development. Indeed, the ratios of the imputed prices of atmospheric carbon dioxide and land forests over the per capita level of real national income are identical for all countries involved. (3 figures, 2 tables) (Author)

  15. Imputation of Baseline LDL Cholesterol Concentration in Patients with Familial Hypercholesterolemia on Statins or Ezetimibe.

    Science.gov (United States)

    Ruel, Isabelle; Aljenedil, Sumayah; Sadri, Iman; de Varennes, Émilie; Hegele, Robert A; Couture, Patrick; Bergeron, Jean; Wanneh, Eric; Baass, Alexis; Dufour, Robert; Gaudet, Daniel; Brisson, Diane; Brunham, Liam R; Francis, Gordon A; Cermakova, Lubomira; Brophy, James M; Ryomoto, Arnold; Mancini, G B John; Genest, Jacques

    2018-02-01

    Familial hypercholesterolemia (FH) is the most frequent genetic disorder seen clinically and is characterized by increased LDL cholesterol (LDL-C) (>95th percentile), family history of increased LDL-C, premature atherosclerotic cardiovascular disease (ASCVD) in the patient or in first-degree relatives, presence of tendinous xanthomas or premature corneal arcus, or presence of a pathogenic mutation in the LDLR , PCSK9 , or APOB genes. A diagnosis of FH has important clinical implications with respect to lifelong risk of ASCVD and requirement for intensive pharmacological therapy. The concentration of baseline LDL-C (untreated) is essential for the diagnosis of FH but is often not available because the individual is already on statin therapy. To validate a new algorithm to impute baseline LDL-C, we examined 1297 patients. The baseline LDL-C was compared with the imputed baseline obtained within 18 months of the initiation of therapy. We compared the percent reduction in LDL-C on treatment from baseline with the published percent reductions. After eliminating individuals with missing data, nonstandard doses of statins, or medications other than statins or ezetimibe, we provide data on 951 patients. The mean ± SE baseline LDL-C was 243.0 (2.2) mg/dL [6.28 (0.06) mmol/L], and the mean ± SE imputed baseline LDL-C was 244.2 (2.6) mg/dL [6.31 (0.07) mmol/L] ( P = 0.48). There was no difference in response according to the patient's sex or in percent reduction between observed and expected for individual doses or types of statin or ezetimibe. We provide a validated estimation of baseline LDL-C for patients with FH that may help clinicians in making a diagnosis. © 2017 American Association for Clinical Chemistry.

  16. 48 CFR 1830.7002-4 - Determining imputed cost of money.

    Science.gov (United States)

    2010-10-01

    ... money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS AND... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...

  17. Mapping change of older forest with nearest-neighbor imputation and Landsat time-series

    Science.gov (United States)

    Janet L. Ohmann; Matthew J. Gregory; Heather M. Roberts; Warren B. Cohen; Robert E. Kennedy; Zhiqiang. Yang

    2012-01-01

    The Northwest Forest Plan (NWFP), which aims to conserve late-successional and old-growth forests (older forests) and associated species, established new policies on federal lands in the Pacific Northwest USA. As part of monitoring for the NWFP, we tested nearest-neighbor imputation for mapping change in older forest, defined by threshold values for forest attributes...

  18. Genotype Imputation for Latinos Using the HapMap and 1000 Genomes Project Reference Panels

    Directory of Open Access Journals (Sweden)

    Xiaoyi eGao

    2012-06-01

    Full Text Available Genotype imputation is a vital tool in genome-wide association studies (GWAS and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR+CEU+YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation-based analysis in Latinos.

  19. Imputation of genotypes in Danish two-way crossbred pigs using low density panels

    DEFF Research Database (Denmark)

    Xiang, Tao; Christensen, Ole Fredslund; Legarra, Andres

    Genotype imputation is commonly used as an initial step of genomic selection. Studies on humans, plants and ruminants suggested many factors would affect the performance of imputation. However, studies rarely investigated pigs, especially crossbred pigs. In this study, different scenarios...... of imputation from 5K SNPs to 7K SNPs on Danish Landrace, Yorkshire, and crossbred Landrace-Yorkshire were compared. In conclusion, genotype imputation on crossbreds performs equally well as in purebreds, when parental breeds are used as the reference panel. When the size of reference is considerably large...... SNPs. This dataset will be analyzed for genomic selection in a future study...

  20. Analyzing the Impacts of Alternated Number of Iterations in Multiple Imputation Method on Explanatory Factor Analysis

    Directory of Open Access Journals (Sweden)

    Duygu KOÇAK

    2017-11-01

    Full Text Available The study aims to identify the effects of iteration numbers used in multiple iteration method, one of the methods used to cope with missing values, on the results of factor analysis. With this aim, artificial datasets of different sample sizes were created. Missing values at random and missing values at complete random were created in various ratios by deleting data. For the data in random missing values, a second variable was iterated at ordinal scale level and datasets with different ratios of missing values were obtained based on the levels of this variable. The data were generated using “psych” program in R software, while “dplyr” program was used to create codes that would delete values according to predetermined conditions of missing value mechanism. Different datasets were generated by applying different iteration numbers. Explanatory factor analysis was conducted on the datasets completed and the factors and total explained variances are presented. These values were first evaluated based on the number of factors and total variance explained of the complete datasets. The results indicate that multiple iteration method yields a better performance in cases of missing values at random compared to datasets with missing values at complete random. Also, it was found that increasing the number of iterations in both missing value datasets decreases the difference in the results obtained from complete datasets.

  1. Imputing amino acid polymorphisms in human leukocyte antigens.

    Directory of Open Access Journals (Sweden)

    Xiaoming Jia

    Full Text Available DNA sequence variation within human leukocyte antigen (HLA genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A, -B, -C and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals. We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N = 918 with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the high-density Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.

  2. Using ecological propensity score to adjust for missing confounders in small area studies.

    Science.gov (United States)

    Wang, Yingbo; Pirani, Monica; Hansell, Anna L; Richardson, Sylvia; Blangiardo, Marta

    2017-11-09

    Small area ecological studies are commonly used in epidemiology to assess the impact of area level risk factors on health outcomes when data are only available in an aggregated form. However, the resulting estimates are often biased due to unmeasured confounders, which typically are not available from the standard administrative registries used for these studies. Extra information on confounders can be provided through external data sets such as surveys or cohorts, where the data are available at the individual level rather than at the area level; however, such data typically lack the geographical coverage of administrative registries. We develop a framework of analysis which combines ecological and individual level data from different sources to provide an adjusted estimate of area level risk factors which is less biased. Our method (i) summarizes all available individual level confounders into an area level scalar variable, which we call ecological propensity score (EPS), (ii) implements a hierarchical structured approach to impute the values of EPS whenever they are missing, and (iii) includes the estimated and imputed EPS into the ecological regression linking the risk factors to the health outcome. Through a simulation study, we show that integrating individual level data into small area analyses via EPS is a promising method to reduce the bias intrinsic in ecological studies due to unmeasured confounders; we also apply the method to a real case study to evaluate the effect of air pollution on coronary heart disease hospital admissions in Greater London. © The Author 2017. Published by Oxford University Press.

  3. Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy.

    Science.gov (United States)

    Johnson, Eric O; Hancock, Dana B; Levy, Joshua L; Gaddis, Nathan C; Saccone, Nancy L; Bierut, Laura J; Page, Grier P

    2013-05-01

    A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.

  4. Multiple Imputation of a Randomly Censored Covariate Improves Logistic Regression Analysis.

    Science.gov (United States)

    Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A

    2016-01-01

    Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.

  5. Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies

    Directory of Open Access Journals (Sweden)

    McElwee Joshua

    2009-06-01

    Full Text Available Abstract Background Although high-throughput genotyping arrays have made whole-genome association studies (WGAS feasible, only a small proportion of SNPs in the human genome are actually surveyed in such studies. In addition, various SNP arrays assay different sets of SNPs, which leads to challenges in comparing results and merging data for meta-analyses. Genome-wide imputation of untyped markers allows us to address these issues in a direct fashion. Methods 384 Caucasian American liver donors were genotyped using Illumina 650Y (Ilmn650Y arrays, from which we also derived genotypes from the Ilmn317K array. On these data, we compared two imputation methods: MACH and BEAGLE. We imputed 2.5 million HapMap Release22 SNPs, and conducted GWAS on ~40,000 liver mRNA expression traits (eQTL analysis. In addition, 200 Caucasian American and 200 African American subjects were genotyped using the Affymetrix 500 K array plus a custom 164 K fill-in chip. We then imputed the HapMap SNPs and quantified the accuracy by randomly masking observed SNPs. Results MACH and BEAGLE perform similarly with respect to imputation accuracy. The Ilmn650Y results in excellent imputation performance, and it outperforms Affx500K or Ilmn317K sets. For Caucasian Americans, 90% of the HapMap SNPs were imputed at 98% accuracy. As expected, imputation of poorly tagged SNPs (untyped SNPs in weak LD with typed markers was not as successful. It was more challenging to impute genotypes in the African American population, given (1 shorter LD blocks and (2 admixture with Caucasian populations in this population. To address issue (2, we pooled HapMap CEU and YRI data as an imputation reference set, which greatly improved overall performance. The approximate 40,000 phenotypes scored in these populations provide a path to determine empirically how the power to detect associations is affected by the imputation procedures. That is, at a fixed false discovery rate, the number of cis

  6. Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits

    NARCIS (Netherlands)

    I. Tachmazidou (Ioanna); Süveges, D. (Dániel); J. Min (Josine); G.R.S. Ritchie (Graham R.S.); Steinberg, J. (Julia); K. Walter (Klaudia); V. Iotchkova (Valentina); J.A. Schwartzentruber (Jeremy); J. Huang (Jian); Y. Memari (Yasin); McCarthy, S. (Shane); Crawford, A.A. (Andrew A.); C. Bombieri (Cristina); M. Cocca (Massimiliano); A.-E. Farmaki (Aliki-Eleni); T.R. Gaunt (Tom); P. Jousilahti (Pekka); M.N. Kooijman (Marjolein ); Lehne, B. (Benjamin); G. Malerba (Giovanni); S. Männistö (Satu); A. Matchan (Angela); M.C. Medina-Gomez (Carolina); S. Metrustry (Sarah); A. Nag (Abhishek); I. Ntalla (Ioanna); L. Paternoster (Lavinia); N.W. Rayner (Nigel William); C. Sala (Cinzia); W.R. Scott (William R.); H.A. Shihab (Hashem A.); L. Southam (Lorraine); B. St Pourcain (Beate); M. Traglia (Michela); K. Trajanoska (Katerina); Zaza, G. (Gialuigi); W. Zhang (Weihua); M.S. Artigas; Bansal, N. (Narinder); M. Benn (Marianne); Chen, Z. (Zhongsheng); P. Danecek (Petr); Lin, W.-Y. (Wei-Yu); A. Locke (Adam); J. Luan (Jian'An); A.K. Manning (Alisa); Mulas, A. (Antonella); C. Sidore (Carlo); A. Tybjaerg-Hansen; A. Varbo (Anette); M. Zoledziewska (Magdalena); C. Finan (Chris); Hatzikotoulas, K. (Konstantinos); A.E. Hendricks (Audrey E.); J.P. Kemp (John); A. Moayyeri (Alireza); Panoutsopoulou, K. (Kalliope); Szpak, M. (Michal); S.G. Wilson (Scott); M. Boehnke (Michael); F. Cucca (Francesco); Di Angelantonio, E. (Emanuele); C. Langenberg (Claudia); C.M. Lindgren (Cecilia M.); McCarthy, M.I. (Mark I.); A.P. Morris (Andrew); B.G. Nordestgaard (Børge); R.A. Scott (Robert); M.D. Tobin (Martin); N.J. Wareham (Nick); P.R. Burton (Paul); J.C. Chambers (John); Smith, G.D. (George Davey); G.V. Dedoussis (George); J.F. Felix (Janine); O.H. Franco (Oscar); Gambaro, G. (Giovanni); P. Gasparini (Paolo); C.J. Hammond (Christopher J.); A. Hofman (Albert); V.W.V. Jaddoe (Vincent); M.E. Kleber (Marcus); J.S. Kooner (Jaspal S.); M. Perola (Markus); C.L. Relton (Caroline); S.M. Ring (Susan); F. Rivadeneira Ramirez (Fernando); V. Salomaa (Veikko); T.D. Spector (Timothy); O. Stegle (Oliver); D. Toniolo (Daniela); A.G. Uitterlinden (André); I.E. Barroso (Inês); C.M.T. Greenwood (Celia); Perry, J.R.B. (John R.B.); Walker, B.R. (Brian R.); A.S. Butterworth (Adam); Y. Xue (Yali); R. Durbin (Richard); K.S. Small (Kerrin); N. Soranzo (Nicole); N.J. Timpson (Nicholas); E. Zeggini (Eleftheria)

    2016-01-01

    textabstractDeep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the

  7. Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits

    DEFF Research Database (Denmark)

    Tachmazidou, Ioanna; Süveges, Dániel; Min, Josine L

    2017-01-01

    Deep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader alleli...

  8. Application of pattern mixture models to address missing data in longitudinal data analysis using SPSS.

    Science.gov (United States)

    Son, Heesook; Friedmann, Erika; Thomas, Sue A

    2012-01-01

    Longitudinal studies are used in nursing research to examine changes over time in health indicators. Traditional approaches to longitudinal analysis of means, such as analysis of variance with repeated measures, are limited to analyzing complete cases. This limitation can lead to biased results due to withdrawal or data omission bias or to imputation of missing data, which can lead to bias toward the null if data are not missing completely at random. Pattern mixture models are useful to evaluate the informativeness of missing data and to adjust linear mixed model (LMM) analyses if missing data are informative. The aim of this study was to provide an example of statistical procedures for applying a pattern mixture model to evaluate the informativeness of missing data and conduct analyses of data with informative missingness in longitudinal studies using SPSS. The data set from the Patients' and Families' Psychological Response to Home Automated External Defibrillator Trial was used as an example to examine informativeness of missing data with pattern mixture models and to use a missing data pattern in analysis of longitudinal data. Prevention of withdrawal bias, omitted data bias, and bias toward the null in longitudinal LMMs requires the assessment of the informativeness of the occurrence of missing data. Missing data patterns can be incorporated as fixed effects into LMMs to evaluate the contribution of the presence of informative missingness to and control for the effects of missingness on outcomes. Pattern mixture models are a useful method to address the presence and effect of informative missingness in longitudinal studies.

  9. Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data

    OpenAIRE

    Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha

    2015-01-01

    Most clinical and biomedical data contain missing values. A patient’s record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across tim...

  10. Fully conditional specification in multivariate imputation

    NARCIS (Netherlands)

    van Buuren, S.; Brand, J. P.L.; Groothuis-Oudshoorn, C. G.M.; Rubin, D. B.

    2006-01-01

    The use of the Gibbs sampler with fully conditionally specified models, where the distribution of each variable given the other variables is the starting point, has become a popular method to create imputations in incomplete multivariate data. The theoretical weakness of this approach is that the

  11. A web-based approach to data imputation

    KAUST Repository

    Li, Zhixu; Sharaf, Mohamed Abdel Fattah; Sitbon, Laurianne; Sadiq, Shazia Wasim; Indulska, Marta; Zhou, Xiaofang

    2013-01-01

    principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme

  12. How shared preferences in music create bonds between people: values as the missing link.

    Science.gov (United States)

    Boer, Diana; Fischer, Ronald; Strack, Micha; Bond, Michael H; Lo, Eva; Lam, Jason

    2011-09-01

    How can shared music preferences create social bonds between people? A process model is developed in which music preferences as value-expressive attitudes create social bonds via conveyed value similarity. The musical bonding model links two research streams: (a) music preferences as indicators of similarity in value orientations and (b) similarity in value orientations leading to social attraction. Two laboratory experiments and one dyadic field study demonstrated that music can create interpersonal bonds between young people because music preferences can be cues for similar or dissimilar value orientations, with similarity in values then contributing to social attraction. One study tested and ruled out an alternative explanation (via personality similarity), illuminating the differential impact of perceived value similarity versus personality similarity on social attraction. Value similarity is the missing link in explaining the musical bonding phenomenon, which seems to hold for Western and non-Western samples and in experimental and natural settings.

  13. Substituting missing data in compositional analysis

    Energy Technology Data Exchange (ETDEWEB)

    Real, Carlos, E-mail: carlos.real@usc.es [Area de Ecologia, Departamento de Biologia Celular y Ecologia, Escuela Politecnica Superior, Universidad de Santiago de Compostela, 27002 Lugo (Spain); Angel Fernandez, J.; Aboal, Jesus R.; Carballeira, Alejo [Area de Ecologia, Departamento de Biologia Celular y Ecologia, Facultad de Biologia, Universidad de Santiago de Compostela, 15782 Santiago de Compostela (Spain)

    2011-10-15

    Multivariate analysis of environmental data sets requires the absence of missing values or their substitution by small values. However, if the data is transformed logarithmically prior to the analysis, this solution cannot be applied because the logarithm of a small value might become an outlier. Several methods for substituting the missing values can be found in the literature although none of them guarantees that no distortion of the structure of the data set is produced. We propose a method for the assessment of these distortions which can be used for deciding whether to retain or not the samples or variables containing missing values and for the investigation of the performance of different substitution techniques. The method analyzes the structure of the distances among samples using Mantel tests. We present an application of the method to PCDD/F data measured in samples of terrestrial moss as part of a biomonitoring study. - Highlights: > Missing values in multivariate data sets must be substituted prior to analysis. > The substituted values can modify the structure of the data set. > We developed a method to estimate the magnitude of the alterations. > The method is simple and based on the Mantel test. > The method allowed the identification of problematic variables in a sample data set. - A method is presented for the assessment of the possible distortions in multivariate analysis caused by the substitution of missing values.

  14. Core Modular Blood and Brain Biomarkers in Social Defeat Mouse Model for Post Traumatic Stress Disorder

    Science.gov (United States)

    2013-08-20

    been used to induce anxiety, depression-like and avoidance symptoms, which are the most prominent psychiatric features of PTSD and common co...consideration. We then imputed missing values using the k-nearest neighbor imputation method. To avoid incurring a bias in favor of genes represented by a...Horvath S, Geschwind DH: Divergence of human and mouse brain transcriptome highlights Alzheimer disease pathways. PNAS 2010, 107(28):12698– 12703. 30

  15. Measurement of the 12C(e,e'p)11B Two-Body Breakup Reaction at High Missing Momentum Values

    Energy Technology Data Exchange (ETDEWEB)

    Monaghan, P; Shneor, R; Subedi, R; Anderson, B D; Aniol, K; Annand, J; Arrington, J; Benaoum, H; Benmokhtar, F; Bertin, P; Bertozzi, W; Boeglin, W; Chen, J P; Choi, Seonho; Chudakov, E; Ciofi degli-Atti, C; Cisbani, E; Cosyn, W; Craver, B; de Jager, C W; Feuerbach, R J; Folts, E; Frullani, S; Garibaldi, F; Gayou, O; Gilad, S; Gilman, R; Glamazdin, O; Gomez, J; Hansen, O; Higinbotham, D W; Holmstrom, T; Ibrahim, H; Igarashi, R; Jans, E; Jiang, X; Jiang, Y; Kaufman, L; Kelleher, A; Kolarkar, A; Kuchina, E; Kumbartzki, G; LeRose, J J; Lindgren, R; Liyanage, N; Margaziotis, D J; Markowitz, P; Marrone, S; Mazouz, M; Meekins, D; Michaels, R; Moffit, B; Morita, H; Nanda, S; Perdrisat, C F; Piasetzky, E; Potokar, M; Punjabi, V; Qiang, Y; Reinhold, J; Reitz, B; Ron, G; Rosner, G; Ryckebusch, J; Saha, A; Sawatzky, B; Segal, J; Shahinyan, A; Sirca, S; Slifer, K; Solvignon, P; Sulkosky, V; Thompson, N; Ulmer, P E; Urciuoli, G M; Voutier, E; Wang, K; Watson, J W; Weinstein, L B; Wojtsekhowski, B; Wood, S; Yao, H; Zheng, X; Zhu, L

    2014-08-01

    The five-fold differential cross section for the 12C(e,e'p)11B reaction was determined over a missing momentum range of 200-400 MeV/c, in a kinematics regime with Bjorken x > 1 and Q2 = 2.0 (GeV/c)2. A comparison of the results and theoretical models and previous lower missing momentum data is shown. The theoretical calculations agree well with the data up to a missing momentum value of 325 MeV/c and then diverge for larger missing momenta. The extracted distorted momentum distribution is shown to be consistent with previous data and extends the range of available data up to 400 MeV/c.

  16. Software for handling and replacement of missing data

    Directory of Open Access Journals (Sweden)

    Mayer, Benjamin

    2009-10-01

    Full Text Available In medical research missing values often arise in the course of a data analysis. This fact constitutes a problem for different reasons, so e.g. standard methods for analyzing data lead to biased estimates and a loss of statistical power due to missing values, since those methods require complete data sets and therefore omit incomplete cases for the analyses. Furthermore missing values imply a certain loss of information for what reason the validity of results of a study with missing values has to be rated less than in a case where all data had been available. For years there are methods for replacement of missing values (Rubin, Schafer to tackle these problems and solve them in parts. Hence in this article we want to present the existing software to handle and replace missing values on the one hand and give an outline about the available options to get information on the other hand. The methodological aspects of the replacement strategies are delineated just briefly in this article.

  17. PRIMAL: Fast and accurate pedigree-based imputation from sequence data in a founder population.

    Directory of Open Access Journals (Sweden)

    Oren E Livne

    2015-03-01

    Full Text Available Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm, a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs, from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.

  18. Substituting missing data in compositional analysis

    International Nuclear Information System (INIS)

    Real, Carlos; Angel Fernandez, J.; Aboal, Jesus R.; Carballeira, Alejo

    2011-01-01

    Multivariate analysis of environmental data sets requires the absence of missing values or their substitution by small values. However, if the data is transformed logarithmically prior to the analysis, this solution cannot be applied because the logarithm of a small value might become an outlier. Several methods for substituting the missing values can be found in the literature although none of them guarantees that no distortion of the structure of the data set is produced. We propose a method for the assessment of these distortions which can be used for deciding whether to retain or not the samples or variables containing missing values and for the investigation of the performance of different substitution techniques. The method analyzes the structure of the distances among samples using Mantel tests. We present an application of the method to PCDD/F data measured in samples of terrestrial moss as part of a biomonitoring study. - Highlights: → Missing values in multivariate data sets must be substituted prior to analysis. → The substituted values can modify the structure of the data set. → We developed a method to estimate the magnitude of the alterations. → The method is simple and based on the Mantel test. → The method allowed the identification of problematic variables in a sample data set. - A method is presented for the assessment of the possible distortions in multivariate analysis caused by the substitution of missing values.

  19. Sequence imputation of HPV16 genomes for genetic association studies.

    Directory of Open Access Journals (Sweden)

    Benjamin Smith

    Full Text Available Human Papillomavirus type 16 (HPV16 causes over half of all cervical cancer and some HPV16 variants are more oncogenic than others. The genetic basis for the extraordinary oncogenic properties of HPV16 compared to other HPVs is unknown. In addition, we neither know which nucleotides vary across and within HPV types and lineages, nor which of the single nucleotide polymorphisms (SNPs determine oncogenicity.A reference set of 62 HPV16 complete genome sequences was established and used to examine patterns of evolutionary relatedness amongst variants using a pairwise identity heatmap and HPV16 phylogeny. A BLAST-based algorithm was developed to impute complete genome data from partial sequence information using the reference database. To interrogate the oncogenic risk of determined and imputed HPV16 SNPs, odds-ratios for each SNP were calculated in a case-control viral genome-wide association study (VWAS using biopsy confirmed high-grade cervix neoplasia and self-limited HPV16 infections from Guanacaste, Costa Rica.HPV16 variants display evolutionarily stable lineages that contain conserved diagnostic SNPs. The imputation algorithm indicated that an average of 97.5±1.03% of SNPs could be accurately imputed. The VWAS revealed specific HPV16 viral SNPs associated with variant lineages and elevated odds ratios; however, individual causal SNPs could not be distinguished with certainty due to the nature of HPV evolution.Conserved and lineage-specific SNPs can be imputed with a high degree of accuracy from limited viral polymorphic data due to the lack of recombination and the stochastic mechanism of variation accumulation in the HPV genome. However, to determine the role of novel variants or non-lineage-specific SNPs by VWAS will require direct sequence analysis. The investigation of patterns of genetic variation and the identification of diagnostic SNPs for lineages of HPV16 variants provides a valuable resource for future studies of HPV16

  20. Imputation Accuracy from Low to Moderate Density Single Nucleotide Polymorphism Chips in a Thai Multibreed Dairy Cattle Population

    Directory of Open Access Journals (Sweden)

    Danai Jattawa

    2016-04-01

    Full Text Available The objective of this study was to investigate the accuracy of imputation from low density (LDC to moderate density SNP chips (MDC in a Thai Holstein-Other multibreed dairy cattle population. Dairy cattle with complete pedigree information (n = 1,244 from 145 dairy farms were genotyped with GeneSeek GGP20K (n = 570, GGP26K (n = 540 and GGP80K (n = 134 chips. After checking for single nucleotide polymorphism (SNP quality, 17,779 SNP markers in common between the GGP20K, GGP26K, and GGP80K were used to represent MDC. Animals were divided into two groups, a reference group (n = 912 and a test group (n = 332. The SNP markers chosen for the test group were those located in positions corresponding to GeneSeek GGP9K (n = 7,652. The LDC to MDC genotype imputation was carried out using three different software packages, namely Beagle 3.3 (population-based algorithm, FImpute 2.2 (combined family- and population-based algorithms and Findhap 4 (combined family- and population-based algorithms. Imputation accuracies within and across chromosomes were calculated as ratios of correctly imputed SNP markers to overall imputed SNP markers. Imputation accuracy for the three software packages ranged from 76.79% to 93.94%. FImpute had higher imputation accuracy (93.94% than Findhap (84.64% and Beagle (76.79%. Imputation accuracies were similar and consistent across chromosomes for FImpute, but not for Findhap and Beagle. Most chromosomes that showed either high (73% or low (80% imputation accuracies were the same chromosomes that had above and below average linkage disequilibrium (LD; defined here as the correlation between pairs of adjacent SNP within chromosomes less than or equal to 1 Mb apart. Results indicated that FImpute was more suitable than Findhap and Beagle for genotype imputation in this Thai multibreed population. Perhaps additional increments in imputation accuracy could be achieved by increasing the completeness of pedigree information.

  1. The utility of imputed matched sets. Analyzing probabilistically linked databases in a low information setting.

    Science.gov (United States)

    Thomas, A M; Cook, L J; Dean, J M; Olson, L M

    2014-01-01

    To compare results from high probability matched sets versus imputed matched sets across differing levels of linkage information. A series of linkages with varying amounts of available information were performed on two simulated datasets derived from multiyear motor vehicle crash (MVC) and hospital databases, where true matches were known. Distributions of high probability and imputed matched sets were compared against the true match population for occupant age, MVC county, and MVC hour. Regression models were fit to simulated log hospital charges and hospitalization status. High probability and imputed matched sets were not significantly different from occupant age, MVC county, and MVC hour in high information settings (p > 0.999). In low information settings, high probability matched sets were significantly different from occupant age and MVC county (p sets were not (p > 0.493). High information settings saw no significant differences in inference of simulated log hospital charges and hospitalization status between the two methods. High probability and imputed matched sets were significantly different from the outcomes in low information settings; however, imputed matched sets were more robust. The level of information available to a linkage is an important consideration. High probability matched sets are suitable for high to moderate information settings and for situations involving case-specific analysis. Conversely, imputed matched sets are preferable for low information settings when conducting population-based analyses.

  2. An Imputation Model for Dropouts in Unemployment Data

    Directory of Open Access Journals (Sweden)

    Nilsson Petra

    2016-09-01

    Full Text Available Incomplete unemployment data is a fundamental problem when evaluating labour market policies in several countries. Many unemployment spells end for unknown reasons; in the Swedish Public Employment Service’s register as many as 20 percent. This leads to an ambiguity regarding destination states (employment, unemployment, retired, etc.. According to complete combined administrative data, the employment rate among dropouts was close to 50 for the years 1992 to 2006, but from 2007 the employment rate has dropped to 40 or less. This article explores an imputation approach. We investigate imputation models estimated both on survey data from 2005/2006 and on complete combined administrative data from 2005/2006 and 2011/2012. The models are evaluated in terms of their ability to make correct predictions. The models have relatively high predictive power.

  3. Flexible Modeling of Survival Data with Covariates Subject to Detection Limits via Multiple Imputation.

    Science.gov (United States)

    Bernhardt, Paul W; Wang, Huixia Judy; Zhang, Daowen

    2014-01-01

    Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.

  4. Intention-to-treat analyses and missing data approaches in pharmacotherapy trials for alcohol use disorders.

    Science.gov (United States)

    Del Re, A C; Maisel, Natalya C; Blodgett, Janet C; Finney, John W

    2013-11-12

    Intention to treat (ITT) is an analytic strategy for reducing potential bias in treatment effects arising from missing data in randomised controlled trials (RCTs). Currently, no universally accepted definition of ITT exists, although many researchers consider it to require either no attrition or a strategy to handle missing data. Using the reports of a large pool of RCTs, we examined discrepancies between the types of analyses that alcohol pharmacotherapy researchers stated they used versus those they actually used. We also examined the linkage between analytic strategy (ie, ITT or not) and how missing data on outcomes were handled (if at all), and whether data analytic and missing data strategies have changed over time. Descriptive statistics were generated for reported and actual data analytic strategy and for missing data strategy. In addition, generalised linear models determined changes over time in the use of ITT analyses and missing data strategies. 165 RCTs of pharmacotherapy for alcohol use disorders. Of the 165 studies, 74 reported using an ITT strategy. However, less than 40% of the studies actually conducted ITT according to the rigorous definition above. Whereas no change in the use of ITT analyses over time was found, censored (last follow-up completed) and imputed missing data strategies have increased over time, while analyses of data only for the sample actually followed have decreased. Discrepancies in reporting versus actually conducting ITT analyses were found in this body of RCTs. Lack of clarity regarding the missing data strategy used was common. Consensus on a definition of ITT is important for an adequate understanding of research findings. Clearer reporting standards for analyses and the handling of missing data in pharmacotherapy trials and other intervention studies are needed.

  5. An efficient method to transcription factor binding sites imputation via simultaneous completion of multiple matrices with positional consistency.

    Science.gov (United States)

    Guo, Wei-Li; Huang, De-Shuang

    2017-08-22

    Transcription factors (TFs) are DNA-binding proteins that have a central role in regulating gene expression. Identification of DNA-binding sites of TFs is a key task in understanding transcriptional regulation, cellular processes and disease. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) enables genome-wide identification of in vivo TF binding sites. However, it is still difficult to map every TF in every cell line owing to cost and biological material availability, which poses an enormous obstacle for integrated analysis of gene regulation. To address this problem, we propose a novel computational approach, TFBSImpute, for predicting additional TF binding profiles by leveraging information from available ChIP-seq TF binding data. TFBSImpute fuses the dataset to a 3-mode tensor and imputes missing TF binding signals via simultaneous completion of multiple TF binding matrices with positional consistency. We show that signals predicted by our method achieve overall similarity with experimental data and that TFBSImpute significantly outperforms baseline approaches, by assessing the performance of imputation methods against observed ChIP-seq TF binding profiles. Besides, motif analysis shows that TFBSImpute preforms better in capturing binding motifs enriched in observed data compared with baselines, indicating that the higher performance of TFBSImpute is not simply due to averaging related samples. We anticipate that our approach will constitute a useful complement to experimental mapping of TF binding, which is beneficial for further study of regulation mechanisms and disease.

  6. Inclusion of Population-specific Reference Panel from India to the 1000 Genomes Phase 3 Panel Improves Imputation Accuracy.

    Science.gov (United States)

    Ahmad, Meraj; Sinha, Anubhav; Ghosh, Sreya; Kumar, Vikrant; Davila, Sonia; Yajnik, Chittaranjan S; Chandak, Giriraj R

    2017-07-27

    Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.

  7. Increasing imputation and prediction accuracy for Chinese Holsteins using joint Chinese-Nordic reference population

    DEFF Research Database (Denmark)

    Ma, Peipei; Lund, Mogens Sandø; Ding, X

    2015-01-01

    This study investigated the effect of including Nordic Holsteins in the reference population on the imputation accuracy and prediction accuracy for Chinese Holsteins. The data used in this study include 85 Chinese Holstein bulls genotyped with both 54K chip and 777K (HD) chip, 2862 Chinese cows...... was improved slightly when using the marker data imputed based on the combined HD reference data, compared with using the marker data imputed based on the Chinese HD reference data only. On the other hand, when using the combined reference population including 4398 Nordic Holstein bulls, the accuracy...... to increase reference population rather than increasing marker density...

  8. Imputation of the rare HOXB13 G84E mutation and cancer risk in a large population-based cohort.

    Directory of Open Access Journals (Sweden)

    Thomas J Hoffmann

    2015-01-01

    Full Text Available An efficient approach to characterizing the disease burden of rare genetic variants is to impute them into large well-phenotyped cohorts with existing genome-wide genotype data using large sequenced referenced panels. The success of this approach hinges on the accuracy of rare variant imputation, which remains controversial. For example, a recent study suggested that one cannot adequately impute the HOXB13 G84E mutation associated with prostate cancer risk (carrier frequency of 0.0034 in European ancestry participants in the 1000 Genomes Project. We show that by utilizing the 1000 Genomes Project data plus an enriched reference panel of mutation carriers we were able to accurately impute the G84E mutation into a large cohort of 83,285 non-Hispanic White participants from the Kaiser Permanente Research Program on Genes, Environment and Health Genetic Epidemiology Research on Adult Health and Aging cohort. Imputation authenticity was confirmed via a novel classification and regression tree method, and then empirically validated analyzing a subset of these subjects plus an additional 1,789 men from Kaiser specifically genotyped for the G84E mutation (r2 = 0.57, 95% CI = 0.37–0.77. We then show the value of this approach by using the imputed data to investigate the impact of the G84E mutation on age-specific prostate cancer risk and on risk of fourteen other cancers in the cohort. The age-specific risk of prostate cancer among G84E mutation carriers was higher than among non-carriers. Risk estimates from Kaplan-Meier curves were 36.7% versus 13.6% by age 72, and 64.2% versus 24.2% by age 80, for G84E mutation carriers and non-carriers, respectively (p = 3.4x10-12. The G84E mutation was also associated with an increase in risk for the fourteen other most common cancers considered collectively (p = 5.8x10-4 and more so in cases diagnosed with multiple cancer types, both those including and not including prostate cancer, strongly suggesting

  9. Comparison of different methods for imputing genome-wide marker genotypes in Swedish and Finnish Red Cattle

    DEFF Research Database (Denmark)

    Ma, Peipei; Brøndum, Rasmus Froberg; Qin, Zahng

    2013-01-01

    This study investigated the imputation accuracy of different methods, considering both the minor allele frequency and relatedness between individuals in the reference and test data sets. Two data sets from the combined population of Swedish and Finnish Red Cattle were used to test the influence...... coefficient was lower when the minor allele frequency was lower. The results indicate that Beagle and IMPUTE2 provide the most robust and accurate imputation accuracies, but considering computing time and memory usage, FImpute is another alternative method....

  10. Imputing forest carbon stock estimates from inventory plots to a nationally continuous coverage

    Directory of Open Access Journals (Sweden)

    Wilson Barry Tyler

    2013-01-01

    Full Text Available Abstract The U.S. has been providing national-scale estimates of forest carbon (C stocks and stock change to meet United Nations Framework Convention on Climate Change (UNFCCC reporting requirements for years. Although these currently are provided as national estimates by pool and year to meet greenhouse gas monitoring requirements, there is growing need to disaggregate these estimates to finer scales to enable strategic forest management and monitoring activities focused on various ecosystem services such as C storage enhancement. Through application of a nearest-neighbor imputation approach, spatially extant estimates of forest C density were developed for the conterminous U.S. using the U.S.’s annual forest inventory. Results suggest that an existing forest inventory plot imputation approach can be readily modified to provide raster maps of C density across a range of pools (e.g., live tree to soil organic carbon and spatial scales (e.g., sub-county to biome. Comparisons among imputed maps indicate strong regional differences across C pools. The C density of pools closely related to detrital input (e.g., dead wood is often highest in forests suffering from recent mortality events such as those in the northern Rocky Mountains (e.g., beetle infestations. In contrast, live tree carbon density is often highest on the highest quality forest sites such as those found in the Pacific Northwest. Validation results suggest strong agreement between the estimates produced from the forest inventory plots and those from the imputed maps, particularly when the C pool is closely associated with the imputation model (e.g., aboveground live biomass and live tree basal area, with weaker agreement for detrital pools (e.g., standing dead trees. Forest inventory imputed plot maps provide an efficient and flexible approach to monitoring diverse C pools at national (e.g., UNFCCC and regional scales (e.g., Reducing Emissions from Deforestation and Forest

  11. Hypertension in Pregnancy among Rural Women in Katsina State ...

    African Journals Online (AJOL)

    2017-06-30

    Jun 30, 2017 ... studies to the best of our search were in southern Nigeria. [7,12 ... Nigeria has the worst statistics of maternal death in ... data on socio-demographic, medical history and risk ... multiple imputations for a few missing values.

  12. Improving accuracy of genomic prediction in Brangus cattle by adding animals with imputed low-density SNP genotypes.

    Science.gov (United States)

    Lopes, F B; Wu, X-L; Li, H; Xu, J; Perkins, T; Genho, J; Ferretti, R; Tait, R G; Bauck, S; Rosa, G J M

    2018-02-01

    Reliable genomic prediction of breeding values for quantitative traits requires the availability of sufficient number of animals with genotypes and phenotypes in the training set. As of 31 October 2016, there were 3,797 Brangus animals with genotypes and phenotypes. These Brangus animals were genotyped using different commercial SNP chips. Of them, the largest group consisted of 1,535 animals genotyped by the GGP-LDV4 SNP chip. The remaining 2,262 genotypes were imputed to the SNP content of the GGP-LDV4 chip, so that the number of animals available for training the genomic prediction models was more than doubled. The present study showed that the pooling of animals with both original or imputed 40K SNP genotypes substantially increased genomic prediction accuracies on the ten traits. By supplementing imputed genotypes, the relative gains in genomic prediction accuracies on estimated breeding values (EBV) were from 12.60% to 31.27%, and the relative gain in genomic prediction accuracies on de-regressed EBV was slightly small (i.e. 0.87%-18.75%). The present study also compared the performance of five genomic prediction models and two cross-validation methods. The five genomic models predicted EBV and de-regressed EBV of the ten traits similarly well. Of the two cross-validation methods, leave-one-out cross-validation maximized the number of animals at the stage of training for genomic prediction. Genomic prediction accuracy (GPA) on the ten quantitative traits was validated in 1,106 newly genotyped Brangus animals based on the SNP effects estimated in the previous set of 3,797 Brangus animals, and they were slightly lower than GPA in the original data. The present study was the first to leverage currently available genotype and phenotype resources in order to harness genomic prediction in Brangus beef cattle. © 2018 Blackwell Verlag GmbH.

  13. Comparing strategies for selection of low-density SNPs for imputation-mediated genomic prediction in U. S. Holsteins.

    Science.gov (United States)

    He, Jun; Xu, Jiaqi; Wu, Xiao-Lin; Bauck, Stewart; Lee, Jungjae; Morota, Gota; Kachman, Stephen D; Spangler, Matthew L

    2018-04-01

    SNP chips are commonly used for genotyping animals in genomic selection but strategies for selecting low-density (LD) SNPs for imputation-mediated genomic selection have not been addressed adequately. The main purpose of the present study was to compare the performance of eight LD (6K) SNP panels, each selected by a different strategy exploiting a combination of three major factors: evenly-spaced SNPs, increased minor allele frequencies, and SNP-trait associations either for single traits independently or for all the three traits jointly. The imputation accuracies from 6K to 80K SNP genotypes were between 96.2 and 98.2%. Genomic prediction accuracies obtained using imputed 80K genotypes were between 0.817 and 0.821 for daughter pregnancy rate, between 0.838 and 0.844 for fat yield, and between 0.850 and 0.863 for milk yield. The two SNP panels optimized on the three major factors had the highest genomic prediction accuracy (0.821-0.863), and these accuracies were very close to those obtained using observed 80K genotypes (0.825-0.868). Further exploration of the underlying relationships showed that genomic prediction accuracies did not respond linearly to imputation accuracies, but were significantly affected by genotype (imputation) errors of SNPs in association with the traits to be predicted. SNPs optimal for map coverage and MAF were favorable for obtaining accurate imputation of genotypes whereas trait-associated SNPs improved genomic prediction accuracies. Thus, optimal LD SNP panels were the ones that combined both strengths. The present results have practical implications on the design of LD SNP chips for imputation-enabled genomic prediction.

  14. Design of a bovine low-density SNP array optimized for imputation.

    Directory of Open Access Journals (Sweden)

    Didier Boichard

    Full Text Available The Illumina BovineLD BeadChip was designed to support imputation to higher density genotypes in dairy and beef breeds by including single-nucleotide polymorphisms (SNPs that had a high minor allele frequency as well as uniform spacing across the genome except at the ends of the chromosome where densities were increased. The chip also includes SNPs on the Y chromosome and mitochondrial DNA loci that are useful for determining subspecies classification and certain paternal and maternal breed lineages. The total number of SNPs was 6,909. Accuracy of imputation to Illumina BovineSNP50 genotypes using the BovineLD chip was over 97% for most dairy and beef populations. The BovineLD imputations were about 3 percentage points more accurate than those from the Illumina GoldenGate Bovine3K BeadChip across multiple populations. The improvement was greatest when neither parent was genotyped. The minor allele frequencies were similar across taurine beef and dairy breeds as was the proportion of SNPs that were polymorphic. The new BovineLD chip should facilitate low-cost genomic selection in taurine beef and dairy cattle.

  15. Using the Superpopulation Model for Imputations and Variance Computation in Survey Sampling

    Directory of Open Access Journals (Sweden)

    Petr Novák

    2012-03-01

    Full Text Available This study is aimed at variance computation techniques for estimates of population characteristics based on survey sampling and imputation. We use the superpopulation regression model, which means that the target variable values for each statistical unit are treated as random realizations of a linear regression model with weighted variance. We focus on regression models with one auxiliary variable and no intercept, which have many applications and straightforward interpretation in business statistics. Furthermore, we deal with caseswhere the estimates are not independent and thus the covariance must be computed. We also consider chained regression models with auxiliary variables as random variables instead of constants.

  16. Exploring Missing Values on Responses to Experienced and Labeled Event as Harassment in 2004 Reserves Data

    Science.gov (United States)

    2008-07-01

    Personal Experiences of Sexual Harassment and Missing Values on Sexual Harassment Questions by Perceptions of Sexism in a Unit (Quartiles... sexism in a unit). The “worst” category indicates units with the highest levels of reported sexist behavior, and the “best” category indicates the...Education and Prevention, 19 (6), 519–530. Harris, R. J., & Firestone, J. M., (1997). Subtle sexism in the U.S. Military: Individual responses to

  17. RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.

    Directory of Open Access Journals (Sweden)

    Ji-Sung Kim

    2018-04-01

    Full Text Available Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees. RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error, precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9. We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1 a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2 uneven accessibility and subjective importance of prophylactic health, (3 possible variation in lifestyle, such as dietary habits, and (4 differences in background genetic variation which predispose to diseases.

  18. RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

    KAUST Repository

    Kim, Ji-Sung

    2018-04-26

    Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.

  19. An Improved DINEOF Algorithm for Filling Missing Values in Spatio-Temporal Sea Surface Temperature Data.

    Science.gov (United States)

    Ping, Bo; Su, Fenzhen; Meng, Yunshan

    2016-01-01

    In this study, an improved Data INterpolating Empirical Orthogonal Functions (DINEOF) algorithm for determination of missing values in a spatio-temporal dataset is presented. Compared with the ordinary DINEOF algorithm, the iterative reconstruction procedure until convergence based on every fixed EOF to determine the optimal EOF mode is not necessary and the convergence criterion is only reached once in the improved DINEOF algorithm. Moreover, in the ordinary DINEOF algorithm, after optimal EOF mode determination, the initial matrix with missing data will be iteratively reconstructed based on the optimal EOF mode until the reconstruction is convergent. However, the optimal EOF mode may be not the best EOF for some reconstructed matrices generated in the intermediate steps. Hence, instead of using asingle EOF to fill in the missing data, in the improved algorithm, the optimal EOFs for reconstruction are variable (because the optimal EOFs are variable, the improved algorithm is called VE-DINEOF algorithm in this study). To validate the accuracy of the VE-DINEOF algorithm, a sea surface temperature (SST) data set is reconstructed by using the DINEOF, I-DINEOF (proposed in 2015) and VE-DINEOF algorithms. Four parameters (Pearson correlation coefficient, signal-to-noise ratio, root-mean-square error, and mean absolute difference) are used as a measure of reconstructed accuracy. Compared with the DINEOF and I-DINEOF algorithms, the VE-DINEOF algorithm can significantly enhance the accuracy of reconstruction and shorten the computational time.

  20. Inflation in Africa, 1960-2015

    NARCIS (Netherlands)

    Ph.H.B.F. Franses (Philip Hans); E. Janssens (Eva)

    2017-01-01

    markdownabstractWe present various stylized facts about annual CPI based inflation in 47 African countries. Some stylized facts concern time series properties for each of the series but also across series. To achieve a useful and relevant dataset, we impute all missing values in the sample

  1. A spatial haplotype copying model with applications to genotype imputation.

    Science.gov (United States)

    Yang, Wen-Yun; Hormozdiari, Farhad; Eskin, Eleazar; Pasaniuc, Bogdan

    2015-05-01

    Ever since its introduction, the haplotype copy model has proven to be one of the most successful approaches for modeling genetic variation in human populations, with applications ranging from ancestry inference to genotype phasing and imputation. Motivated by coalescent theory, this approach assumes that any chromosome (haplotype) can be modeled as a mosaic of segments copied from a set of chromosomes sampled from the same population. At the core of the model is the assumption that any chromosome from the sample is equally likely to contribute a priori to the copying process. Motivated by recent works that model genetic variation in a geographic continuum, we propose a new spatial-aware haplotype copy model that jointly models geography and the haplotype copying process. We extend hidden Markov models of haplotype diversity such that at any given location, haplotypes that are closest in the genetic-geographic continuum map are a priori more likely to contribute to the copying process than distant ones. Through simulations starting from the 1000 Genomes data, we show that our model achieves superior accuracy in genotype imputation over the standard spatial-unaware haplotype copy model. In addition, we show the utility of our model in selecting a small personalized reference panel for imputation that leads to both improved accuracy as well as to a lower computational runtime than the standard approach. Finally, we show our proposed model can be used to localize individuals on the genetic-geographical map on the basis of their genotype data.

  2. Comparison of population-averaged and cluster-specific models for the analysis of cluster randomized trials with missing binary outcomes: a simulation study

    Directory of Open Access Journals (Sweden)

    Ma Jinhui

    2013-01-01

    Full Text Available Abstracts Background The objective of this simulation study is to compare the accuracy and efficiency of population-averaged (i.e. generalized estimating equations (GEE and cluster-specific (i.e. random-effects logistic regression (RELR models for analyzing data from cluster randomized trials (CRTs with missing binary responses. Methods In this simulation study, clustered responses were generated from a beta-binomial distribution. The number of clusters per trial arm, the number of subjects per cluster, intra-cluster correlation coefficient, and the percentage of missing data were allowed to vary. Under the assumption of covariate dependent missingness, missing outcomes were handled by complete case analysis, standard multiple imputation (MI and within-cluster MI strategies. Data were analyzed using GEE and RELR. Performance of the methods was assessed using standardized bias, empirical standard error, root mean squared error (RMSE, and coverage probability. Results GEE performs well on all four measures — provided the downward bias of the standard error (when the number of clusters per arm is small is adjusted appropriately — under the following scenarios: complete case analysis for CRTs with a small amount of missing data; standard MI for CRTs with variance inflation factor (VIF 50. RELR performs well only when a small amount of data was missing, and complete case analysis was applied. Conclusion GEE performs well as long as appropriate missing data strategies are adopted based on the design of CRTs and the percentage of missing data. In contrast, RELR does not perform well when either standard or within-cluster MI strategy is applied prior to the analysis.

  3. 40 CFR 98.405 - Procedures for estimating missing data.

    Science.gov (United States)

    2010-07-01

    ... meter malfunctions), a substitute data value for the missing quantity measurement must be used in the... period for any reason, the reporter shall use either its delivering pipeline measurements or the default... § 98.405 Procedures for estimating missing data. (a) Whenever a quality-assured value of the quantity...

  4. Reading Profiles in Multi-Site Data With Missingness.

    Science.gov (United States)

    Eckert, Mark A; Vaden, Kenneth I; Gebregziabher, Mulugeta

    2018-01-01

    Children with reading disability exhibit varied deficits in reading and cognitive abilities that contribute to their reading comprehension problems. Some children exhibit primary deficits in phonological processing, while others can exhibit deficits in oral language and executive functions that affect comprehension. This behavioral heterogeneity is problematic when missing data prevent the characterization of different reading profiles, which often occurs in retrospective data sharing initiatives without coordinated data collection. Here we show that reading profiles can be reliably identified based on Random Forest classification of incomplete behavioral datasets, after the missForest method is used to multiply impute missing values. Results from simulation analyses showed that reading profiles could be accurately classified across degrees of missingness (e.g., ∼5% classification error for 30% missingness across the sample). The application of missForest to a real multi-site dataset with missingness ( n = 924) showed that reading disability profiles significantly and consistently differed in reading and cognitive abilities for cases with and without missing data. The results of validation analyses indicated that the reading profiles (cases with and without missing data) exhibited significant differences for an independent set of behavioral variables that were not used to classify reading profiles. Together, the results show how multiple imputation can be applied to the classification of cases with missing data and can increase the integrity of results from multi-site open access datasets.

  5. Reading Profiles in Multi-Site Data With Missingness

    Directory of Open Access Journals (Sweden)

    Mark A. Eckert

    2018-05-01

    Full Text Available Children with reading disability exhibit varied deficits in reading and cognitive abilities that contribute to their reading comprehension problems. Some children exhibit primary deficits in phonological processing, while others can exhibit deficits in oral language and executive functions that affect comprehension. This behavioral heterogeneity is problematic when missing data prevent the characterization of different reading profiles, which often occurs in retrospective data sharing initiatives without coordinated data collection. Here we show that reading profiles can be reliably identified based on Random Forest classification of incomplete behavioral datasets, after the missForest method is used to multiply impute missing values. Results from simulation analyses showed that reading profiles could be accurately classified across degrees of missingness (e.g., ∼5% classification error for 30% missingness across the sample. The application of missForest to a real multi-site dataset with missingness (n = 924 showed that reading disability profiles significantly and consistently differed in reading and cognitive abilities for cases with and without missing data. The results of validation analyses indicated that the reading profiles (cases with and without missing data exhibited significant differences for an independent set of behavioral variables that were not used to classify reading profiles. Together, the results show how multiple imputation can be applied to the classification of cases with missing data and can increase the integrity of results from multi-site open access datasets.

  6. Customer value - the missing link

    International Nuclear Information System (INIS)

    Pierce, A.R.; Martin, M.G.; Wagner, V.E.

    1993-01-01

    For many years electric utilities found it easy to provide value to their shareholders. With a monopoly service and decreasing costs it was easy to sell 70% more electricity each year and earn attractive returns. In the last 20 years electric utilities have teamed that it is not possible to provide value to their shareholders without providing value to their customers. Detroit Edison is learning that customer value is not always what the utility thinks it is. There is no better way to find out what customers value than to ask them. Detroit Edison has done a lot of direct asking in the last couple of years, through market research and individual interviews, and has learned indirectly from customers when a particular program does not succeed as we thought it should. Two areas where more has been learned about customer value are Demand Side Management (DSM) and Power Quality

  7. Handling missing data in transmission disequilibrium test in nuclear families with one affected offspring.

    Directory of Open Access Journals (Sweden)

    Gulhan Bourget

    Full Text Available The Transmission Disequilibrium Test (TDT compares frequencies of transmission of two alleles from heterozygote parents to an affected offspring. This test requires all genotypes to be known from all members of the nuclear families. However, obtaining all genotypes in a study might not be possible for some families, in which case, a data set results in missing genotypes. There are many techniques of handling missing genotypes in parents but only a few in offspring. The robust TDT (rTDT is one of the methods that handles missing genotypes for all members of nuclear families [with one affected offspring]. Even though all family members can be imputed, the rTDT is a conservative test with low power. We propose a new method, Mendelian Inheritance TDT (MITDT-ONE, that controls type I error and has high power. The MITDT-ONE uses Mendelian Inheritance properties, and takes population frequencies of the disease allele and marker allele into account in the rTDT method. One of the advantages of using the MITDT-ONE is that the MITDT-ONE can identify additional significant genes that are not found by the rTDT. We demonstrate the performances of both tests along with Sib-TDT (S-TDT in Monte Carlo simulation studies. Moreover, we apply our method to the type 1 diabetes data from the Warren families in the United Kingdom to identify significant genes that are related to type 1 diabetes.

  8. AUTOMATIC CLASSIFICATION OF VARIABLE STARS IN CATALOGS WITH MISSING DATA

    International Nuclear Information System (INIS)

    Pichara, Karim; Protopapas, Pavlos

    2013-01-01

    We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks and a probabilistic graphical model that allows us to perform inference to predict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilizes sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model, we use three catalogs with missing data (SAGE, Two Micron All Sky Survey, and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches, and at what computational cost. Integrating these catalogs with missing data, we find that classification of variable objects improves by a few percent and by 15% for quasar detection while keeping the computational cost the same

  9. AUTOMATIC CLASSIFICATION OF VARIABLE STARS IN CATALOGS WITH MISSING DATA

    Energy Technology Data Exchange (ETDEWEB)

    Pichara, Karim [Computer Science Department, Pontificia Universidad Católica de Chile, Santiago (Chile); Protopapas, Pavlos [Institute for Applied Computational Science, Harvard University, Cambridge, MA (United States)

    2013-11-10

    We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks and a probabilistic graphical model that allows us to perform inference to predict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilizes sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model, we use three catalogs with missing data (SAGE, Two Micron All Sky Survey, and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches, and at what computational cost. Integrating these catalogs with missing data, we find that classification of variable objects improves by a few percent and by 15% for quasar detection while keeping the computational cost the same.

  10. RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

    KAUST Repository

    Kim, Ji-Sung; Gao, Xin; Rzhetsky, Andrey

    2018-01-01

    are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race

  11. Nonparametric autocovariance estimation from censored time series by Gaussian imputation.

    Science.gov (United States)

    Park, Jung Wook; Genton, Marc G; Ghosh, Sujit K

    2009-02-01

    One of the most frequently used methods to model the autocovariance function of a second-order stationary time series is to use the parametric framework of autoregressive and moving average models developed by Box and Jenkins. However, such parametric models, though very flexible, may not always be adequate to model autocovariance functions with sharp changes. Furthermore, if the data do not follow the parametric model and are censored at a certain value, the estimation results may not be reliable. We develop a Gaussian imputation method to estimate an autocovariance structure via nonparametric estimation of the autocovariance function in order to address both censoring and incorrect model specification. We demonstrate the effectiveness of the technique in terms of bias and efficiency with simulations under various rates of censoring and underlying models. We describe its application to a time series of silicon concentrations in the Arctic.

  12. Addressing the challenges of obtaining functional outcomes in traumatic brain injury research: missing data patterns, timing of follow-up, and three prognostic models.

    Science.gov (United States)

    Zelnick, Leila R; Morrison, Laurie J; Devlin, Sean M; Bulger, Eileen M; Brasel, Karen J; Sheehan, Kellie; Minei, Joseph P; Kerby, Jeffrey D; Tisherman, Samuel A; Rizoli, Sandro; Karmy-Jones, Riyad; van Heest, Rardi; Newgard, Craig D

    2014-06-01

    Traumatic brain injury (TBI) is common and debilitating. Randomized trials of interventions for TBI ideally assess effectiveness by using long-term functional neurological outcomes, but such outcomes are difficult to obtain and costly. If there is little change between functional status at hospital discharge versus 6 months, then shorter-term outcomes may be adequate for use in future clinical trials. Using data from a previously published multi-center, randomized, placebo-controlled TBI clinical trial, we evaluated patterns of missing outcome data, changes in functional status between hospital discharge and 6 months, and three prognostic models to predict long-term functional outcome from covariates available at hospital discharge (functional measures, demographics, and injury characteristics). The Resuscitation Outcomes Consortium Hypertonic Saline trial enrolled 1282 TBI patients, obtaining the primary outcome of 6-month Glasgow Outcome Score Extended (GOSE) for 85% of patients, but missing the primary outcome for the remaining 15%. Patients with missing outcomes had less-severe injuries, higher neurological function at discharge (GOSE), and shorter hospital stays than patients whose GOSE was obtained. Of 1066 (83%) patients whose GOSE was obtained both at hospital discharge and at 6-months, 71% of patients had the same dichotomized functional status (severe disability/death vs. moderate/no disability) after 6 months as at discharge, 28% had an improved functional status, and 1% had worsened. Performance was excellent (C-statistic between 0.88 and 0.91) for all three prognostic models and calibration adequate for two models (p values, 0.22 and 0.85). Our results suggest that multiple imputation of the standard 6-month GOSE may be reasonable in TBI research when the primary outcome cannot be obtained through other means.

  13. Exploring the Interplay between Rescue Drugs, Data Imputation, and Study Outcomes: Conceptual Review and Qualitative Analysis of an Acute Pain Data Set.

    Science.gov (United States)

    Singla, Neil K; Meske, Diana S; Desjardins, Paul J

    2017-12-01

    In placebo-controlled acute surgical pain studies, provisions must be made for study subjects to receive adequate analgesic therapy. As such, most protocols allow study subjects to receive a pre-specified regimen of open-label analgesic drugs (rescue drugs) as needed. The selection of an appropriate rescue regimen is a critical experimental design choice. We hypothesized that a rescue regimen that is too liberal could lead to all study arms receiving similar levels of pain relief (thereby confounding experimental results), while a regimen that is too stringent could lead to a high subject dropout rate (giving rise to a preponderance of missing data). Despite the importance of rescue regimen as a study design feature, there exist no published review articles or meta-analysis focusing on the impact of rescue therapy on experimental outcomes. Therefore, when selecting a rescue regimen, researchers must rely on clinical factors (what analgesics do patients usually receive in similar surgical scenarios) and/or anecdotal evidence. In the following article, we attempt to bridge this gap by reviewing and discussing the experimental impacts of rescue therapy on a common acute surgical pain population: first metatarsal bunionectomy. The function of this analysis is to (1) create a framework for discussion and future exploration of rescue as a methodological study design feature, (2) discuss the interplay between data imputation techniques and rescue drugs, and (3) inform the readership regarding the impact of data imputation techniques on the validity of study conclusions. Our findings indicate that liberal rescue may degrade assay sensitivity, while stringent rescue may lead to unacceptably high dropout rates.

  14. Construction and application of a Korean reference panel for imputing classical alleles and amino acids of human leukocyte antigen genes.

    Science.gov (United States)

    Kim, Kwangwoo; Bang, So-Young; Lee, Hye-Soon; Bae, Sang-Cheol

    2014-01-01

    Genetic variations of human leukocyte antigen (HLA) genes within the major histocompatibility complex (MHC) locus are strongly associated with disease susceptibility and prognosis for many diseases, including many autoimmune diseases. In this study, we developed a Korean HLA reference panel for imputing classical alleles and amino acid residues of several HLA genes. An HLA reference panel has potential for use in identifying and fine-mapping disease associations with the MHC locus in East Asian populations, including Koreans. A total of 413 unrelated Korean subjects were analyzed for single nucleotide polymorphisms (SNPs) at the MHC locus and six HLA genes, including HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1. The HLA reference panel was constructed by phasing the 5,858 MHC SNPs, 233 classical HLA alleles, and 1,387 amino acid residue markers from 1,025 amino acid positions as binary variables. The imputation accuracy of the HLA reference panel was assessed by measuring concordance rates between imputed and genotyped alleles of the HLA genes from a subset of the study subjects and East Asian HapMap individuals. Average concordance rates were 95.6% and 91.1% at 2-digit and 4-digit allele resolutions, respectively. The imputation accuracy was minimally affected by SNP density of a test dataset for imputation. In conclusion, the Korean HLA reference panel we developed was highly suitable for imputing HLA alleles and amino acids from MHC SNPs in East Asians, including Koreans.

  15. Construction and application of a Korean reference panel for imputing classical alleles and amino acids of human leukocyte antigen genes.

    Directory of Open Access Journals (Sweden)

    Kwangwoo Kim

    Full Text Available Genetic variations of human leukocyte antigen (HLA genes within the major histocompatibility complex (MHC locus are strongly associated with disease susceptibility and prognosis for many diseases, including many autoimmune diseases. In this study, we developed a Korean HLA reference panel for imputing classical alleles and amino acid residues of several HLA genes. An HLA reference panel has potential for use in identifying and fine-mapping disease associations with the MHC locus in East Asian populations, including Koreans. A total of 413 unrelated Korean subjects were analyzed for single nucleotide polymorphisms (SNPs at the MHC locus and six HLA genes, including HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1. The HLA reference panel was constructed by phasing the 5,858 MHC SNPs, 233 classical HLA alleles, and 1,387 amino acid residue markers from 1,025 amino acid positions as binary variables. The imputation accuracy of the HLA reference panel was assessed by measuring concordance rates between imputed and genotyped alleles of the HLA genes from a subset of the study subjects and East Asian HapMap individuals. Average concordance rates were 95.6% and 91.1% at 2-digit and 4-digit allele resolutions, respectively. The imputation accuracy was minimally affected by SNP density of a test dataset for imputation. In conclusion, the Korean HLA reference panel we developed was highly suitable for imputing HLA alleles and amino acids from MHC SNPs in East Asians, including Koreans.

  16. Analyzing the changing gender wage gap based on multiply inputed right censored wages

    OpenAIRE

    Gartner, Hermann; Rässler, Susanne

    2005-01-01

    In order to analyze the gender wage gap with the German IAB-employment sample we have to solve the problem of censored wages at the upper limit of the social security system. We treat this problem as a missing data problem. We regard the missingness mechanism as not missing at random (NMAR, according to Little and Rubin, 1987, 2002) as well as missing by design. The censored wages are multiply imputed by draws of a random variable from a truncated distribution. The multiple imputation is base...

  17. Using mi impute chained to fit ANCOVA models in randomized trials with censored dependent and independent variables

    DEFF Research Database (Denmark)

    Andersen, Andreas; Rieckmann, Andreas

    2016-01-01

    In this article, we illustrate how to use mi impute chained with intreg to fit an analysis of covariance analysis of censored and nondetectable immunological concentrations measured in a randomized pretest–posttest design.......In this article, we illustrate how to use mi impute chained with intreg to fit an analysis of covariance analysis of censored and nondetectable immunological concentrations measured in a randomized pretest–posttest design....

  18. Robust K-Median and K-Means Clustering Algorithms for Incomplete Data

    Directory of Open Access Journals (Sweden)

    Jinhua Li

    2016-01-01

    Full Text Available Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and K-means. However, in practice, it is often hard to obtain accurate estimation of the missing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.

  19. METHODS FOR CLUSTERING TIME SERIES DATA ACQUIRED FROM MOBILE HEALTH APPS.

    Science.gov (United States)

    Tignor, Nicole; Wang, Pei; Genes, Nicholas; Rogers, Linda; Hershman, Steven G; Scott, Erick R; Zweig, Micol; Yvonne Chan, Yu-Feng; Schadt, Eric E

    2017-01-01

    In our recent Asthma Mobile Health Study (AMHS), thousands of asthma patients across the country contributed medical data through the iPhone Asthma Health App on a daily basis for an extended period of time. The collected data included daily self-reported asthma symptoms, symptom triggers, and real time geographic location information. The AMHS is just one of many studies occurring in the context of now many thousands of mobile health apps aimed at improving wellness and better managing chronic disease conditions, leveraging the passive and active collection of data from mobile, handheld smart devices. The ability to identify patient groups or patterns of symptoms that might predict adverse outcomes such as asthma exacerbations or hospitalizations from these types of large, prospectively collected data sets, would be of significant general interest. However, conventional clustering methods cannot be applied to these types of longitudinally collected data, especially survey data actively collected from app users, given heterogeneous patterns of missing values due to: 1) varying survey response rates among different users, 2) varying survey response rates over time of each user, and 3) non-overlapping periods of enrollment among different users. To handle such complicated missing data structure, we proposed a probability imputation model to infer missing data. We also employed a consensus clustering strategy in tandem with the multiple imputation procedure. Through simulation studies under a range of scenarios reflecting real data conditions, we identified favorable performance of the proposed method over other strategies that impute the missing value through low-rank matrix completion. When applying the proposed new method to study asthma triggers and symptoms collected as part of the AMHS, we identified several patient groups with distinct phenotype patterns. Further validation of the methods described in this paper might be used to identify clinically important

  20. Missed rib fractures on evaluation of initial chest CT for trauma patients: pattern analysis and diagnostic value of coronal multiplanar reconstruction images with multidetector row CT.

    Science.gov (United States)

    Cho, S H; Sung, Y M; Kim, M S

    2012-10-01

    The objective of this study was to review the prevalence and radiological features of rib fractures missed on initial chest CT evaluation, and to examine the diagnostic value of additional coronal images in a large series of trauma patients. 130 patients who presented to an emergency room for blunt chest trauma underwent multidetector row CT of the thorax within the first hour during their stay, and had follow-up CT or bone scans as diagnostic gold standards. Images were evaluated on two separate occasions: once with axial images and once with both axial and coronal images. The detection rates of missed rib fractures were compared between readings using a non-parametric method of clustered data. In the cases of missed rib fractures, the shapes, locations and associated fractures were evaluated. 58 rib fractures were missed with axial images only and 52 were missed with both axial and coronal images (p=0.088). The most common shape of missed rib fractures was buckled (56.9%), and the anterior arc (55.2%) was most commonly involved. 21 (36.2%) missed rib fractures had combined fractures on the same ribs, and 38 (65.5%) were accompanied by fracture on neighbouring ribs. Missed rib fractures are not uncommon, and radiologists should be familiar with buckle fractures, which are frequently missed. Additional coronal imagescan be helpful in the diagnosis of rib fractures that are not seen on axial images.

  1. Monthly water quality forecasting and uncertainty assessment via bootstrapped wavelet neural networks under missing data for Harbin, China.

    Science.gov (United States)

    Wang, Yi; Zheng, Tong; Zhao, Ying; Jiang, Jiping; Wang, Yuanyuan; Guo, Liang; Wang, Peng

    2013-12-01

    In this paper, bootstrapped wavelet neural network (BWNN) was developed for predicting monthly ammonia nitrogen (NH(4+)-N) and dissolved oxygen (DO) in Harbin region, northeast of China. The Morlet wavelet basis function (WBF) was employed as a nonlinear activation function of traditional three-layer artificial neural network (ANN) structure. Prediction intervals (PI) were constructed according to the calculated uncertainties from the model structure and data noise. Performance of BWNN model was also compared with four different models: traditional ANN, WNN, bootstrapped ANN, and autoregressive integrated moving average model. The results showed that BWNN could handle the severely fluctuating and non-seasonal time series data of water quality, and it produced better performance than the other four models. The uncertainty from data noise was smaller than that from the model structure for NH(4+)-N; conversely, the uncertainty from data noise was larger for DO series. Besides, total uncertainties in the low-flow period were the biggest due to complicated processes during the freeze-up period of the Songhua River. Further, a data missing-refilling scheme was designed, and better performances of BWNNs for structural data missing (SD) were observed than incidental data missing (ID). For both ID and SD, temporal method was satisfactory for filling NH(4+)-N series, whereas spatial imputation was fit for DO series. This filling BWNN forecasting method was applied to other areas suffering "real" data missing, and the results demonstrated its efficiency. Thus, the methods introduced here will help managers to obtain informed decisions.

  2. Partial imputation to improve predictive modelling in insurance risk classification using a hybrid positive selection algorithm and correlation-based feature selection

    CSIR Research Space (South Africa)

    Duma, M

    2013-09-01

    Full Text Available of missing data, with a decline in performance as the amount of missing data increases. Wagner et al.18 presented a study aimed at constructing a multimodal, ensemble of classifiers for emotion recog- nition with missing values in one or multiple... classification accuracies of 55%, which includes certain generic fusion schemes and emotion adapted strategies like arousal, valence and cross-axis. There are four kinds of missing data mechanisms found in the literature, namely missing at random (MAR), miss...

  3. A new effective method for estimating missing values in the sequence data prior to phylogenetic analysis

    Directory of Open Access Journals (Sweden)

    Abdoulaye Baniré Diallo

    2006-01-01

    Full Text Available In this article we address the problem of phylogenetic inference from nucleic acid data containing missing bases. We introduce a new effective approach, called “Probabilistic estimation of missing values” (PEMV, allowing one to estimate unknown nucleotides prior to computing the evolutionary distances between them. We show that the new method improves the accuracy of phylogenetic inference compared to the existing methods “Ignoring Missing Sites” (IMS, “Proportional Distribution of Missing and Ambiguous Bases” (PDMAB included in the PAUP software [26]. The proposed strategy for estimating missing nucleotides is based on probabilistic formulae developed in the framework of the Jukes-Cantor [10] and Kimura 2-parameter [11] models. The relative performances of the new method were assessed through simulations carried out with the SeqGen program [20], for data generation, and the BioNJ method [7], for inferring phylogenies. We also compared the new method to the DNAML program [5] and “Matrix Representation using Parsimony” (MRP [13], [19] considering an example of 66 eutherian mammals originally analyzed in [17].

  4. Improved Correction of Misclassification Bias With Bootstrap Imputation.

    Science.gov (United States)

    van Walraven, Carl

    2018-07-01

    Diagnostic codes used in administrative database research can create bias due to misclassification. Quantitative bias analysis (QBA) can correct for this bias, requires only code sensitivity and specificity, but may return invalid results. Bootstrap imputation (BI) can also address misclassification bias but traditionally requires multivariate models to accurately estimate disease probability. This study compared misclassification bias correction using QBA and BI. Serum creatinine measures were used to determine severe renal failure status in 100,000 hospitalized patients. Prevalence of severe renal failure in 86 patient strata and its association with 43 covariates was determined and compared with results in which renal failure status was determined using diagnostic codes (sensitivity 71.3%, specificity 96.2%). Differences in results (misclassification bias) were then corrected with QBA or BI (using progressively more complex methods to estimate disease probability). In total, 7.4% of patients had severe renal failure. Imputing disease status with diagnostic codes exaggerated prevalence estimates [median relative change (range), 16.6% (0.8%-74.5%)] and its association with covariates [median (range) exponentiated absolute parameter estimate difference, 1.16 (1.01-2.04)]. QBA produced invalid results 9.3% of the time and increased bias in estimates of both disease prevalence and covariate associations. BI decreased misclassification bias with increasingly accurate disease probability estimates. QBA can produce invalid results and increase misclassification bias. BI avoids invalid results and can importantly decrease misclassification bias when accurate disease probability estimates are used.

  5. Estimating Potential Evapotranspiration by Missing Temperature Data Reconstruction

    Directory of Open Access Journals (Sweden)

    Eladio Delgadillo-Ruiz

    2015-01-01

    Full Text Available This work studies the statistical characteristics of potential evapotranspiration calculations and their relevance within the water balance used to determine water availability in hydrological basins. The purpose of this study was as follows: first, to apply a missing data reconstruction scheme in weather stations of the Rio Queretaro basin; second, to reduce the generated uncertainty of temperature data: mean, minimum, and maximum values in the evapotranspiration calculation which has a paramount importance in the manner of obtaining the water balance at any hydrological basin. The reconstruction of missing data was carried out in three steps: (1 application of a 4-parameter sinusoidal type regression to temperature data, (2 linear regression to residuals to obtain a regional behavior, and (3 estimation of missing temperature values for a certain year and during a certain season within the basin under study; estimated and observed temperature values were compared. Finally, using the obtained temperature values, the methods of Hamon, Papadakis, Blaney and Criddle, Thornthwaite, and Hargreaves were employed to calculate potential evapotranspiration that was compared to the real observed values in weather stations. With the results obtained from the application of this procedure, the surface water balance was corrected for the case study.

  6. PREDICT: Pattern Representation and Evaluation of Data through Integration, Correlation, and Transformation

    Science.gov (United States)

    2015-10-01

    existing matrix factorization methods to adapt them to time series and potentially identify spurious outliers simultaneously. The best procedure we...unstable patients using linear projections of multivariate time series coupled with a simple color-coded scheme. It is unknown if this particular method ...interpolation and matrix factorization methods for imputing missing values in the multidimensional setting. We also explored custom extensions to

  7. Using imputed genotype data in the joint score tests for genetic association and gene-environment interactions in case-control studies.

    Science.gov (United States)

    Song, Minsun; Wheeler, William; Caporaso, Neil E; Landi, Maria Teresa; Chatterjee, Nilanjan

    2018-03-01

    Genome-wide association studies (GWAS) are now routinely imputed for untyped single nucleotide polymorphisms (SNPs) based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele counts for imputed SNPs as the dosage variable is known to produce valid score test for genetic association. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic associations incorporating gene-environment interactions. We focus on case-control association studies where inference for an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large-scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also describe simple mechanisms for implementing score tests based on standard meta-analysis of "one-step" maximum-likelihood estimates across studies. Applications of the methods in simulation studies and a dataset from GWAS of lung cancer illustrate ability of the proposed methods to maintain type-I error rates for the underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, the retrospective methods can lead to considerable efficiency gain for modeling of gene-environment interactions under the assumption of gene-environment independence. Methods are made available for public use through CGEN R software package. © 2017 WILEY PERIODICALS, INC.

  8. The impact of different strategies to handle missing data on both precision and bias in a drug safety study: a multidatabase multinational population-based cohort study

    Science.gov (United States)

    Martín-Merino, Elisa; Calderón-Larrañaga, Amaia; Hawley, Samuel; Poblador-Plou, Beatriz; Llorente-García, Ana; Petersen, Irene; Prieto-Alhambra, Daniel

    2018-01-01

    Background Missing data are often an issue in electronic medical records (EMRs) research. However, there are many ways that people deal with missing data in drug safety studies. Aim To compare the risk estimates resulting from different strategies for the handling of missing data in the study of venous thromboembolism (VTE) risk associated with antiosteoporotic medications (AOM). Methods New users of AOM (alendronic acid, other bisphosphonates, strontium ranelate, selective estrogen receptor modulators, teriparatide, or denosumab) aged ≥50 years during 1998–2014 were identified in two Spanish (the Base de datos para la Investigación Farmacoepidemiológica en Atención Primaria [BIFAP] and EpiChron cohort) and one UK (Clinical Practice Research Datalink [CPRD]) EMR. Hazard ratios (HRs) according to AOM (with alendronic acid as reference) were calculated adjusting for VTE risk factors, body mass index (that was missing in 61% of patients included in the three databases), and smoking (that was missing in 23% of patients) in the year of AOM therapy initiation. HRs and standard errors obtained using cross-sectional multiple imputation (MI) (reference method) were compared to complete case (CC) analysis – using only patients with complete data – and longitudinal MI – adding to the cross-sectional MI model the body mass index/smoking values as recorded in the year before and after therapy initiation. Results Overall, 422/95,057 (0.4%), 19/12,688 (0.1%), and 2,051/161,202 (1.3%) VTE cases/participants were seen in BIFAP, EpiChron, and CPRD, respectively. HRs moved from 100.00% underestimation to 40.31% overestimation in CC compared with cross-sectional MI, while longitudinal MI methods provided similar risk estimates compared with cross-sectional MI. Precision for HR improved in cross-sectional MI versus CC by up to 160.28%, while longitudinal MI improved precision (compared with cross-sectional) only minimally (up to 0.80%). Conclusion CC may substantially

  9. Partial F-tests with multiply imputed data in the linear regression framework via coefficient of determination.

    Science.gov (United States)

    Chaurasia, Ashok; Harel, Ofer

    2015-02-10

    Tests for regression coefficients such as global, local, and partial F-tests are common in applied research. In the framework of multiple imputation, there are several papers addressing tests for regression coefficients. However, for simultaneous hypothesis testing, the existing methods are computationally intensive because they involve calculation with vectors and (inversion of) matrices. In this paper, we propose a simple method based on the scalar entity, coefficient of determination, to perform (global, local, and partial) F-tests with multiply imputed data. The proposed method is evaluated using simulated data and applied to suicide prevention data. Copyright © 2014 John Wiley & Sons, Ltd.

  10. Analysis of longitudinal data from animals with missing values using SPSS.

    Science.gov (United States)

    Duricki, Denise A; Soleman, Sara; Moon, Lawrence D F

    2016-06-01

    Testing of therapies for disease or injury often involves the analysis of longitudinal data from animals. Modern analytical methods have advantages over conventional methods (particularly when some data are missing), yet they are not used widely by preclinical researchers. Here we provide an easy-to-use protocol for the analysis of longitudinal data from animals, and we present a click-by-click guide for performing suitable analyses using the statistical package IBM SPSS Statistics software (SPSS). We guide readers through the analysis of a real-life data set obtained when testing a therapy for brain injury (stroke) in elderly rats. If a few data points are missing, as in this example data set (for example, because of animal dropout), repeated-measures analysis of covariance may fail to detect a treatment effect. An alternative analysis method, such as the use of linear models (with various covariance structures), and analysis using restricted maximum likelihood estimation (to include all available data) can be used to better detect treatment effects. This protocol takes 2 h to carry out.

  11. Implicit Valuation of the Near-Miss is Dependent on Outcome Context.

    Science.gov (United States)

    Banks, Parker J; Tata, Matthew S; Bennett, Patrick J; Sekuler, Allison B; Gruber, Aaron J

    2018-03-01

    Gambling studies have described a "near-miss effect" wherein the experience of almost winning increases gambling persistence. The near-miss has been proposed to inflate the value of preceding actions through its perceptual similarity to wins. We demonstrate here, however, that it acts as a conditioned stimulus to positively or negatively influence valuation, dependent on reward expectation and cognitive engagement. When subjects are asked to choose between two simulated slot machines, near-misses increase valuation of machines with a low payout rate, whereas they decrease valuation of high payout machines. This contextual effect impairs decisions and persists regardless of manipulations to outcome feedback or financial incentive provided for good performance. It is consistent with proposals that near-misses cause frustration when wins are expected, and we propose that it increases choice stochasticity and overrides avoidance of low-valued options. Intriguingly, the near-miss effect disappears when subjects are required to explicitly value machines by placing bets, rather than choosing between them. We propose that this task increases cognitive engagement and recruits participation of brain regions involved in cognitive processing, causing inhibition of otherwise dominant systems of decision-making. Our results reveal that only implicit, rather than explicit strategies of decision-making are affected by near-misses, and that the brain can fluidly shift between these strategies according to task demands.

  12. Applying an efficient K-nearest neighbor search to forest attribute imputation

    Science.gov (United States)

    Andrew O. Finley; Ronald E. McRoberts; Alan R. Ek

    2006-01-01

    This paper explores the utility of an efficient nearest neighbor (NN) search algorithm for applications in multi-source kNN forest attribute imputation. The search algorithm reduces the number of distance calculations between a given target vector and each reference vector, thereby, decreasing the time needed to discover the NN subset. Results of five trials show gains...

  13. Local exome sequences facilitate imputation of less common variants and increase power of genome wide association studies.

    Directory of Open Access Journals (Sweden)

    Peter K Joshi

    Full Text Available The analysis of less common variants in genome-wide association studies promises to elucidate complex trait genetics but is hampered by low power to reliably detect association. We show that addition of population-specific exome sequence data to global reference data allows more accurate imputation, particularly of less common SNPs (minor allele frequency 1-10% in two very different European populations. The imputation improvement corresponds to an increase in effective sample size of 28-38%, for SNPs with a minor allele frequency in the range 1-3%.

  14. KONVERGENSI ESTIMATOR DALAM MODEL MIXTURE BERBASIS MISSING DATA

    Directory of Open Access Journals (Sweden)

    N Dwidayati

    2014-06-01

    Full Text Available Abstrak __________________________________________________________________________________________ Model mixture dapat mengestimasi proporsi pasien yang sembuh (cured dan fungsi survival pasien tak sembuh (uncured. Pada kajian ini, model mixture dikembangkan untuk  analisis cure rate berbasis missing data. Ada beberapa metode yang dapat digunakan untuk analisis missing data. Salah satu metode yang dapat digunakan adalah Algoritma EM, Metode ini didasarkan pada 2 (dua langkah, yaitu: (1 Expectation Step dan (2 Maximization Step. Algoritma EM merupakan pendekatan iterasi untuk mempelajari model dari data dengan nilai hilang melalui 4 (empat langkah, yaitu(1 pilih himpunan inisial dari parameter untuk sebuah model, (2 tentukan nilai ekspektasi untuk data hilang, (3 buat induksi parameter model baru dari gabungan nilai ekspekstasi dan data asli, dan (4 jika parameter tidak converged, ulangi langkah 2 menggunakan model baru. Berdasar kajian yang dilakukan dapat ditunjukkan bahwa pada algoritma EM, log-likelihood untuk missing data mengalami kenaikan setelah dilakukan setiap iterasi dari algoritmanya. Dengan demikian berdasar algoritma EM, barisan likelihood konvergen jika likelihood terbatas ke bawah.   Abstract __________________________________________________________________________________________ Model mixture can estimate proportion of recovering patient  and function of patient survival do not recover. At this study, model mixture developed to analyse cure rate bases on missing data. There are some method which applicable to analyse missing data. One of method which can be applied is Algoritma EM, This method based on 2 ( two step, that is: ( 1 Expectation Step and ( 2 Maximization Step. EM Algorithm is approach of iteration to study model from data with value loses through 4 ( four step, yaitu(1 select;chooses initial gathering from parameter for a model, ( 2 determines expectation value for data to lose, ( 3 induce newfangled parameter

  15. Random Forest as an Imputation Method for Education and Psychology Research: Its Impact on Item Fit and Difficulty of the Rasch Model

    Science.gov (United States)

    Golino, Hudson F.; Gomes, Cristiano M. A.

    2016-01-01

    This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…

  16. Family-based Association Analyses of Imputed Genotypes Reveal Genome-Wide Significant Association of Alzheimer’s disease with OSBPL6, PTPRG and PDCL3

    Science.gov (United States)

    Herold, Christine; Hooli, Basavaraj V.; Mullin, Kristina; Liu, Tian; Roehr, Johannes T; Mattheisen, Manuel; Parrado, Antonio R.; Bertram, Lars; Lange, Christoph; Tanzi, Rudolph E.

    2015-01-01

    The genetic basis of Alzheimer's disease (AD) is complex and heterogeneous. Over 200 highly penetrant pathogenic variants in the genes APP, PSEN1 and PSEN2 cause a subset of early-onset familial Alzheimer's disease (EOFAD). On the other hand, susceptibility to late-onset forms of AD (LOAD) is indisputably associated to the ε4 allele in the gene APOE, and more recently to variants in more than two-dozen additional genes identified in the large-scale genome-wide association studies (GWAS) and meta-analyses reports. Taken together however, although the heritability in AD is estimated to be as high as 80%, a large proportion of the underlying genetic factors still remain to be elucidated. In this study we performed a systematic family-based genome-wide association and meta-analysis on close to 15 million imputed variants from three large collections of AD families (~3,500 subjects from 1,070 families). Using a multivariate phenotype combining affection status and onset age, meta-analysis of the association results revealed three single nucleotide polymorphisms (SNPs) that achieved genome-wide significance for association with AD risk: rs7609954 in the gene PTPRG (P-value = 3.98·10−08), rs1347297 in the gene OSBPL6 (P-value = 4.53·10−08), and rs1513625 near PDCL3 (P-value = 4.28·10−08). In addition, rs72953347 in OSBPL6 (P-value = 6.36·10−07) and two SNPs in the gene CDKAL1 showed marginally significant association with LOAD (rs10456232, P-value: 4.76·10−07; rs62400067, P-value: 3.54·10−07). In summary, family-based GWAS meta-analysis of imputed SNPs revealed novel genomic variants in (or near) PTPRG, OSBPL6, and PDCL3 that influence risk for AD with genome-wide significance. PMID:26830138

  17. Establishing a threshold for the number of missing days using 7 d pedometer data.

    Science.gov (United States)

    Kang, Minsoo; Hart, Peter D; Kim, Youngdeok

    2012-11-01

    The purpose of this study was to examine the threshold of the number of missing days of recovery using the individual information (II)-centered approach. Data for this study came from 86 participants, aged from 17 to 79 years old, who had 7 consecutive days of complete pedometer (Yamax SW 200) wear. Missing datasets (1 d through 5 d missing) were created by a SAS random process 10,000 times each. All missing values were replaced using the II-centered approach. A 7 d average was calculated for each dataset, including the complete dataset. Repeated measure ANOVA was used to determine the differences between 1 d through 5 d missing datasets and the complete dataset. Mean absolute percentage error (MAPE) was also computed. Mean (SD) daily step count for the complete 7 d dataset was 7979 (3084). Mean (SD) values for the 1 d through 5 d missing datasets were 8072 (3218), 8066 (3109), 7968 (3273), 7741 (3050) and 8314 (3529), respectively (p > 0.05). The lower MAPEs were estimated for 1 d missing (5.2%, 95% confidence interval (CI) 4.4-6.0) and 2 d missing (8.4%, 95% CI 7.0-9.8), while all others were greater than 10%. The results of this study show that the 1 d through 5 d missing datasets, with replaced values, were not significantly different from the complete dataset. Based on the MAPE results, it is not recommended to replace more than two days of missing step counts.

  18. Estimating historical landfill quantities to predict methane emissions

    NARCIS (Netherlands)

    Lyons, S.; Murphy, L.; Tol, R.S.J.

    2010-01-01

    There are no observations for methane emissions from landfill waste in Ireland. Methane emissions are imputed from waste data. There are intermittent data on waste sent to landfill. We compare two alternative ways to impute the missing waste " data" and evaluate the impact on methane emissions. We

  19. The missing link between values and behaviour

    DEFF Research Database (Denmark)

    Brunsø, Karen

    For a long time human values have been perceived as abstrat cognitions representing desired goals or end-states which motivate humnan behaviour. A number of studies have tried to explore the link between values and behaviour, but often different constructs are included as intermediate links between...... values and specific behaviour, since values may be too abstract to influence behaviour directly. We propose the concept of lifestyle as a mediator between values and behaviour, and present our approach to lifestyle based on principles from cognitive psychology, where we distinguish between values...... and lifestyle and behaviour. Based on this appraoch we collected data covering values, lifestyle and behaviour, and estimated the cogntiive hierarchy from values to lifestyle to behaviour by structural equation models....

  20. Age at menopause: imputing age at menopause for women with a hysterectomy with application to risk of postmenopausal breast cancer

    Science.gov (United States)

    Rosner, Bernard; Colditz, Graham A.

    2011-01-01

    Purpose Age at menopause, a major marker in the reproductive life, may bias results for evaluation of breast cancer risk after menopause. Methods We follow 38,948 premenopausal women in 1980 and identify 2,586 who reported hysterectomy without bilateral oophorectomy, and 31,626 who reported natural menopause during 22 years of follow-up. We evaluate risk factors for natural menopause, impute age at natural menopause for women reporting hysterectomy without bilateral oophorectomy and estimate the hazard of reaching natural menopause in the next 2 years. We apply this imputed age at menopause to both increase sample size and to evaluate the relation between postmenopausal exposures and risk of breast cancer. Results Age, cigarette smoking, age at menarche, pregnancy history, body mass index, history of benign breast disease, and history of breast cancer were each significantly related to age at natural menopause; duration of oral contraceptive use and family history of breast cancer were not. The imputation increased sample size substantially and although some risk factors after menopause were weaker in the expanded model (height, and alcohol use), use of hormone therapy is less biased. Conclusions Imputing age at menopause increases sample size, broadens generalizability making it applicable to women with hysterectomy, and reduces bias. PMID:21441037

  1. Bayesian Sensitivity Analysis of Statistical Models with Missing Data.

    Science.gov (United States)

    Zhu, Hongtu; Ibrahim, Joseph G; Tang, Niansheng

    2014-04-01

    Methods for handling missing data depend strongly on the mechanism that generated the missing values, such as missing completely at random (MCAR) or missing at random (MAR), as well as other distributional and modeling assumptions at various stages. It is well known that the resulting estimates and tests may be sensitive to these assumptions as well as to outlying observations. In this paper, we introduce various perturbations to modeling assumptions and individual observations, and then develop a formal sensitivity analysis to assess these perturbations in the Bayesian analysis of statistical models with missing data. We develop a geometric framework, called the Bayesian perturbation manifold, to characterize the intrinsic structure of these perturbations. We propose several intrinsic influence measures to perform sensitivity analysis and quantify the effect of various perturbations to statistical models. We use the proposed sensitivity analysis procedure to systematically investigate the tenability of the non-ignorable missing at random (NMAR) assumption. Simulation studies are conducted to evaluate our methods, and a dataset is analyzed to illustrate the use of our diagnostic measures.

  2. Analyzing time-ordered event data with missed observations.

    Science.gov (United States)

    Dokter, Adriaan M; van Loon, E Emiel; Fokkema, Wimke; Lameris, Thomas K; Nolet, Bart A; van der Jeugd, Henk P

    2017-09-01

    A common problem with observational datasets is that not all events of interest may be detected. For example, observing animals in the wild can difficult when animals move, hide, or cannot be closely approached. We consider time series of events recorded in conditions where events are occasionally missed by observers or observational devices. These time series are not restricted to behavioral protocols, but can be any cyclic or recurring process where discrete outcomes are observed. Undetected events cause biased inferences on the process of interest, and statistical analyses are needed that can identify and correct the compromised detection processes. Missed observations in time series lead to observed time intervals between events at multiples of the true inter-event time, which conveys information on their detection probability. We derive the theoretical probability density function for observed intervals between events that includes a probability of missed detection. Methodology and software tools are provided for analysis of event data with potential observation bias and its removal. The methodology was applied to simulation data and a case study of defecation rate estimation in geese, which is commonly used to estimate their digestive throughput and energetic uptake, or to calculate goose usage of a feeding site from dropping density. Simulations indicate that at a moderate chance to miss arrival events ( p  = 0.3), uncorrected arrival intervals were biased upward by up to a factor 3, while parameter values corrected for missed observations were within 1% of their true simulated value. A field case study shows that not accounting for missed observations leads to substantial underestimates of the true defecation rate in geese, and spurious rate differences between sites, which are introduced by differences in observational conditions. These results show that the derived methodology can be used to effectively remove observational biases in time-ordered event

  3. Missing money and missing markets: Reliability, capacity auctions and interconnectors

    International Nuclear Information System (INIS)

    Newbery, David

    2016-01-01

    In the energy trilemma of reliability, sustainability and affordability, politicians treat reliability as over-riding. The EU assumes the energy-only Target Electricity Model will deliver reliability but the UK argues that a capacity remuneration mechanism is needed. This paper argues that capacity auctions tend to over-procure capacity, exacerbating the missing money problem they were designed to address. The bias is further exacerbated by failing to address some of the missing market problems also neglected in the debate. It examines the case for, criticisms of, and outcome of the first GB capacity auction and problems of trading between different capacity markets. - Highlights: •Energy-only markets can work if they avoid missing money and missing market problems. •Policy makers over-estimate the cost of so-called “loss of load events”. •Policy makers tend to over-procure capacity, exacerbating the missing money problem. •Rectifying missing market problems simplifies trade between different capacity markets. •Addressing missing market problems makes under-procurement cheaper than over-procurement.

  4. Imputation of genotypes from low density (50,000 markers) to high density (700,000 markers) of cows from research herds in Europe, North America, and Australasia using 2 reference populations

    DEFF Research Database (Denmark)

    Pryce, J E; Johnston, J; Hayes, B J

    2014-01-01

    detection in genome-wide association studies and the accuracy of genomic selection may increase when the low-density genotypes are imputed to higher density. Genotype data were available from 10 research herds: 5 from Europe [Denmark, Germany, Ireland, the Netherlands, and the United Kingdom (UK)], 2 from...... reference populations. Although it was not possible to use a combined reference population, which would probably result in the highest accuracies of imputation, differences arising from using 2 high-density reference populations on imputing 50,000-marker genotypes of 583 animals (from the UK) were...... information exploited. The UK animals were also included in the North American data set (n = 1,579) that was imputed to high density using a reference population of 2,018 bulls. After editing, 591,213 genotypes on 5,999 animals from 10 research herds remained. The correlation between imputed allele...

  5. The role of proxy information in missing data analysis.

    Science.gov (United States)

    Huang, Rong; Liang, Yuanyuan; Carrière, K C

    2005-10-01

    This article investigates the role of proxy data in dealing with the common problem of missing data in clinical trials using repeated measures designs. In an effort to avoid the missing data situation, some proxy information can be gathered. The question is how to treat proxy information, that is, is it always better to utilize proxy information when there are missing data? A model for repeated measures data with missing values is considered and a strategy for utilizing proxy information is developed. Then, simulations are used to compare the power of a test using proxy to simply utilizing all available data. It is concluded that using proxy information can be a useful alternative when such information is available. The implications for various clinical designs are also considered and a data collection strategy for efficiently estimating parameters is suggested.

  6. Missed losses loom larger than missed gains: Electrodermal reactivity to decision choices and outcomes in a gambling task.

    Science.gov (United States)

    Wu, Yin; Van Dijk, Eric; Aitken, Mike; Clark, Luke

    2016-04-01

    Loss aversion is a defining characteristic of prospect theory, whereby responses are stronger to losses than to equivalently sized gains (Kahneman & Tversky Econometrica, 47, 263-291, 1979). By monitoring electrodermal activity (EDA) during a gambling task, in this study we examined physiological activity during risky decisions, as well as to both obtained (e.g., gains and losses) and counterfactual (e.g., narrowly missed gains and losses) outcomes. During the bet selection phase, EDA increased linearly with bet size, highlighting the role of somatic signals in decision-making under uncertainty in a task without any learning requirement. Outcome-related EDA scaled with the magnitudes of monetary wins and losses, and losses had a stronger impact on EDA than did equivalently sized wins. Narrowly missed wins (i.e., near-wins) and narrowly missed losses (i.e., near-losses) also evoked EDA responses, and the change of EDA as a function of the size of the missed outcome was modestly greater for near-losses than for near-wins, suggesting that near-losses have more impact on subjective value than do near-wins. Across individuals, the slope for choice-related EDA (as a function of bet size) correlated with the slope for outcome-related EDA as a function of both the obtained and counterfactual outcome magnitudes, and these correlations were stronger for loss and near-loss conditions than for win and near-win conditions. Taken together, these asymmetrical EDA patterns to objective wins and losses, as well as to near-wins and near-losses, provide a psychophysiological instantiation of the value function curve in prospect theory, which is steeper in the negative than in the positive domain.

  7. Dynamic reliability assessment and prediction for repairable systems with interval-censored data

    International Nuclear Information System (INIS)

    Peng, Yizhen; Wang, Yu; Zi, YanYang; Tsui, Kwok-Leung; Zhang, Chuhua

    2017-01-01

    The ‘Test, Analyze and Fix’ process is widely applied to improve the reliability of a repairable system. In this process, dynamic reliability assessment for the system has been paid a great deal of attention. Due to instrument malfunctions, staff omissions and imperfect inspection strategies, field reliability data are often subject to interval censoring, making dynamic reliability assessment become a difficult task. Most traditional methods assume this kind of data as multiple normal distributed variables or the missing mechanism as missing at random, which may cause a large bias in parameter estimation. This paper proposes a novel method to evaluate and predict the dynamic reliability of a repairable system subject to interval-censored problem. First, a multiple imputation strategy based on the assumption that the reliability growth trend follows a nonhomogeneous Poisson process is developed to derive the distributions of missing data. Second, a new order statistic model that can transfer the dependent variables into independent variables is developed to simplify the imputation procedure. The unknown parameters of the model are iteratively inferred by the Monte Carlo expectation maximization (MCEM) algorithm. Finally, to verify the effectiveness of the proposed method, a simulation and a real case study for gas pipeline compressor system are implemented. - Highlights: • A new multiple imputation strategy was developed to derive the PDF of missing data. • A new order statistic model was developed to simplify the imputation procedure. • The parameters of the order statistic model were iteratively inferred by MCEM. • A real cases study was conducted to verify the effectiveness of the proposed method.

  8. Inference for multivariate regression model based on multiply imputed synthetic data generated via posterior predictive sampling

    Science.gov (United States)

    Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.

    2017-06-01

    The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.

  9. Missing Data Bias on a Selective Hedging Strategy

    Directory of Open Access Journals (Sweden)

    Kiss Gábor Dávid

    2017-03-01

    Full Text Available Foreign exchange rates affect corporate profitability both on the macro and cash-flow level. The current study analyses the bias of missing data on a selective hedging strategy, where currency options are applied in case of Value at Risk (1% signs. However, there can be special occasions when one or some data is missing due to lack of a trading activity. This paper focuses on the impact of different missing data handling methods on GARCH and Value at Risk model parameters, because of selective hedging and option pricing based on them. The main added value of the current paper is the comparison of the impact of different methods, such as listwise deletion, mean substitution, and maximum likelihood based Expectation Maximization, on risk management because this subject has insufficient literature. The current study tested daily closing data of floating currencies from Kenya (KES, Ghana (GHS, South Africa (ZAR, Tanzania (TZS, Uganda (UGX, Gambia (GMD, Madagascar (MGA and Mozambique (MZN in USD denomination against EUR/USD rate between March 8, 2000 and March 6, 2015 acquired from the Bloomberg database. Our results suggested the biases of missingness on Value at Risk and volatility models, presenting significant differences among the number of extreme fluctuations or model parameters. A selective hedging strategy can have different expenditures due to the choice of method. This paper suggests the usage of mean substitution or listwise deletion for daily financial time series due to their tendency to have a close to zero first momentum

  10. Missing data and the accuracy of magnetic-observatory hour means

    Directory of Open Access Journals (Sweden)

    J. J. Love

    2009-09-01

    Full Text Available Analysis is made of the accuracy of magnetic-observatory hourly means constructed from definitive minute data having missing values (gaps. Bootstrap sampling from different data-gap distributions is used to estimate average errors on hourly means as a function of the number of missing data. Absolute and relative error results are calculated for horizontal-intensity, declination, and vertical-component data collected at high, medium, and low magnetic latitudes. For 90% complete coverage (10% missing data, average (RMS absolute errors on hourly means are generally less than errors permitted by Intermagnet for minute data. As a rule of thumb, the average relative error for hourly means with 10% missing minute data is approximately equal to 10% of the hourly standard deviation of the source minute data.

  11. Multiple imputation to account for measurement error in marginal structural models

    Science.gov (United States)

    Edwards, Jessie K.; Cole, Stephen R.; Westreich, Daniel; Crane, Heidi; Eron, Joseph J.; Mathews, W. Christopher; Moore, Richard; Boswell, Stephen L.; Lesko, Catherine R.; Mugavero, Michael J.

    2015-01-01

    Background Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and non-differential measurement error in a marginal structural model. Methods We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. Results In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality [hazard ratio (HR): 1.2 (95% CI: 0.6, 2.3)]. The HR for current smoking and therapy (0.4 (95% CI: 0.2, 0.7)) was similar to the HR for no smoking and therapy (0.4; 95% CI: 0.2, 0.6). Conclusions Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies. PMID:26214338

  12. Multiple Imputation to Account for Measurement Error in Marginal Structural Models.

    Science.gov (United States)

    Edwards, Jessie K; Cole, Stephen R; Westreich, Daniel; Crane, Heidi; Eron, Joseph J; Mathews, W Christopher; Moore, Richard; Boswell, Stephen L; Lesko, Catherine R; Mugavero, Michael J

    2015-09-01

    Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and nondifferential measurement error in a marginal structural model. We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3,686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality (hazard ratio [HR]: 1.2 [95% confidence interval [CI] = 0.6, 2.3]). The HR for current smoking and therapy [0.4 (95% CI = 0.2, 0.7)] was similar to the HR for no smoking and therapy (0.4; 95% CI = 0.2, 0.6). Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies.

  13. Predictive model for survival in patients with gastric cancer.

    Science.gov (United States)

    Goshayeshi, Ladan; Hoseini, Benyamin; Yousefli, Zahra; Khooie, Alireza; Etminani, Kobra; Esmaeilzadeh, Abbas; Golabpour, Amin

    2017-12-01

    Gastric cancer is one of the most prevalent cancers in the world. Characterized by poor prognosis, it is a frequent cause of cancer in Iran. The aim of the study was to design a predictive model of survival time for patients suffering from gastric cancer. This was a historical cohort conducted between 2011 and 2016. Study population were 277 patients suffering from gastric cancer. Data were gathered from the Iranian Cancer Registry and the laboratory of Emam Reza Hospital in Mashhad, Iran. Patients or their relatives underwent interviews where it was needed. Missing values were imputed by data mining techniques. Fifteen factors were analyzed. Survival was addressed as a dependent variable. Then, the predictive model was designed by combining both genetic algorithm and logistic regression. Matlab 2014 software was used to combine them. Of the 277 patients, only survival of 80 patients was available whose data were used for designing the predictive model. Mean ?SD of missing values for each patient was 4.43?.41 combined predictive model achieved 72.57% accuracy. Sex, birth year, age at diagnosis time, age at diagnosis time of patients' family, family history of gastric cancer, and family history of other gastrointestinal cancers were six parameters associated with patient survival. The study revealed that imputing missing values by data mining techniques have a good accuracy. And it also revealed six parameters extracted by genetic algorithm effect on the survival of patients with gastric cancer. Our combined predictive model, with a good accuracy, is appropriate to forecast the survival of patients suffering from Gastric cancer. So, we suggest policy makers and specialists to apply it for prediction of patients' survival.

  14. Nurses' personal and ward accountability and missed nursing care: A cross-sectional study.

    Science.gov (United States)

    Srulovici, Einav; Drach-Zahavy, Anat

    2017-10-01

    Missed nursing care is considered an act of omission with potentially detrimental consequences for patients, nurses, and organizations. Although the theoretical conceptualization of missed nursing care specifies nurses' values, attitudes, and perceptions of their work environment as its core antecedents, empirical studies have mainly focused on nurses' socio-demographic and professional attributes. Furthermore, assessment of missed nursing care has been mainly based on same-source methods. This study aimed to test the joint effects of personal and ward accountability on missed nursing care, by using both focal (the nurse whose missed nursing care is examined) and incoming (the nurse responsible for the same patients at the subsequent shift) nurses' assessments of missed nursing care. A cross-sectional design, where nurses were nested in wards. A total of 172 focal and 123 incoming nurses from 32 nursing wards in eight hospitals. Missed nursing care was assessed with the 22-item MISSCARE survey using two sources: focal and incoming nurses. Personal and ward accountability were assessed by the focal nurse with two 19-item scales. Nurses' socio-demographics and ward and shift characteristics were also collected. Mixed linear models were used as the analysis strategy. Focal and incoming nurses reported occasional missed nursing care of the focal nurse (Mean=1.87, SD=0.71 and Mean=2.09, SD=0.84, respectively; r=0.55, ppersonal socio-demographic characteristics, higher personal accountability was significantly associated with decreased missed care (β=-0.29, p0.05). The interaction effect was significant (β=-0.31, ppersonal accountability and missed nursing care. Similar patterns were obtained for the incoming nurses' assessment of focal nurse's missed care. Use of focal and incoming nurses' missed nursing care assessments limited the common source bias and strengthened our findings. Personal and ward accountability are significant values, which are associated with

  15. Estimating Stand Height and Tree Density in Pinus taeda plantations using in-situ data, airborne LiDAR and k-Nearest Neighbor Imputation.

    Science.gov (United States)

    Silva, Carlos Alberto; Klauberg, Carine; Hudak, Andrew T; Vierling, Lee A; Liesenberg, Veraldo; Bernett, Luiz G; Scheraiber, Clewerson F; Schoeninger, Emerson R

    2018-01-01

    Accurate forest inventory is of great economic importance to optimize the entire supply chain management in pulp and paper companies. The aim of this study was to estimate stand dominate and mean heights (HD and HM) and tree density (TD) of Pinus taeda plantations located in South Brazil using in-situ measurements, airborne Light Detection and Ranging (LiDAR) data and the non- k-nearest neighbor (k-NN) imputation. Forest inventory attributes and LiDAR derived metrics were calculated at 53 regular sample plots and we used imputation models to retrieve the forest attributes at plot and landscape-levels. The best LiDAR-derived metrics to predict HD, HM and TD were H99TH, HSD, SKE and HMIN. The Imputation model using the selected metrics was more effective for retrieving height than tree density. The model coefficients of determination (adj.R2) and a root mean squared difference (RMSD) for HD, HM and TD were 0.90, 0.94, 0.38m and 6.99, 5.70, 12.92%, respectively. Our results show that LiDAR and k-NN imputation can be used to predict stand heights with high accuracy in Pinus taeda. However, furthers studies need to be realized to improve the accuracy prediction of TD and to evaluate and compare the cost of acquisition and processing of LiDAR data against the conventional inventory procedures.

  16. On testing the missing at random assumption

    DEFF Research Database (Denmark)

    Jaeger, Manfred

    2006-01-01

    Most approaches to learning from incomplete data are based on the assumption that unobserved values are missing at random (mar). While the mar assumption, as such, is not testable, it can become testable in the context of other distributional assumptions, e.g. the naive Bayes assumption...

  17. Using linked educational attainment data to reduce bias due to missing outcome data in estimates of the association between the duration of breastfeeding and IQ at 15 years.

    Science.gov (United States)

    Cornish, Rosie P; Tilling, Kate; Boyd, Andy; Davies, Amy; Macleod, John

    2015-06-01

    Most epidemiological studies have missing information, leading to reduced power and potential bias. Estimates of exposure-outcome associations will generally be biased if the outcome variable is missing not at random (MNAR). Linkage to administrative data containing a proxy for the missing study outcome allows assessment of whether this outcome is MNAR and the evaluation of bias. We examined this in relation to the association between infant breastfeeding and IQ at 15 years, where a proxy for IQ was available through linkage to school attainment data. Subjects were those who enrolled in the Avon Longitudinal Study of Parents and Children in 1990-91 (n = 13 795), of whom 5023 had IQ measured at age 15. For those with missing IQ, 7030 (79%) had information on educational attainment at age 16 obtained through linkage to the National Pupil Database. The association between duration of breastfeeding and IQ was estimated using a complete case analysis, multiple imputation and inverse probability-of-missingness weighting; these estimates were then compared with those derived from analyses informed by the linkage. IQ at 15 was MNAR-individuals with higher attainment were less likely to have missing IQ data, even after adjusting for socio-demographic factors. All the approaches underestimated the association between breastfeeding and IQ compared with analyses informed by linkage. Linkage to administrative data containing a proxy for the outcome variable allows the MNAR assumption to be tested and more efficient analyses to be performed. Under certain circumstances, this may produce unbiased results. © The Author 2015. Published by Oxford University Press on behalf of the International Epidemiological Association.

  18. Double sampling with multiple imputation to answer large sample meta-research questions: Introduction and illustration by evaluating adherence to two simple CONSORT guidelines

    Directory of Open Access Journals (Sweden)

    Patrice L. Capers

    2015-03-01

    Full Text Available BACKGROUND: Meta-research can involve manual retrieval and evaluation of research, which is resource intensive. Creation of high throughput methods (e.g., search heuristics, crowdsourcing has improved feasibility of large meta-research questions, but possibly at the cost of accuracy. OBJECTIVE: To evaluate the use of double sampling combined with multiple imputation (DS+MI to address meta-research questions, using as an example adherence of PubMed entries to two simple Consolidated Standards of Reporting Trials (CONSORT guidelines for titles and abstracts. METHODS: For the DS large sample, we retrieved all PubMed entries satisfying the filters: RCT; human; abstract available; and English language (n=322,107. For the DS subsample, we randomly sampled 500 entries from the large sample. The large sample was evaluated with a lower rigor, higher throughput (RLOTHI method using search heuristics, while the subsample was evaluated using a higher rigor, lower throughput (RHITLO human rating method. Multiple imputation of the missing-completely-at-random RHITLO data for the large sample was informed by: RHITLO data from the subsample; RLOTHI data from the large sample; whether a study was an RCT; and country and year of publication. RESULTS: The RHITLO and RLOTHI methods in the subsample largely agreed (phi coefficients: title=1.00, abstract=0.92. Compliance with abstract and title criteria has increased over time, with non-US countries improving more rapidly. DS+MI logistic regression estimates were more precise than subsample estimates (e.g., 95% CI for change in title and abstract compliance by Year: subsample RHITLO 1.050-1.174 vs. DS+MI 1.082-1.151. As evidence of improved accuracy, DS+MI coefficient estimates were closer to RHITLO than the large sample RLOTHI. CONCLUSIONS: Our results support our hypothesis that DS+MI would result in improved precision and accuracy. This method is flexible and may provide a practical way to examine large corpora of

  19. The Estimation of Gestational Age at Birth in Database Studies.

    Science.gov (United States)

    Eberg, Maria; Platt, Robert W; Filion, Kristian B

    2017-11-01

    Studies on the safety of prenatal medication use require valid estimation of the pregnancy duration. However, gestational age is often incompletely recorded in administrative and clinical databases. Our objective was to compare different approaches to estimating the pregnancy duration. Using data from the Clinical Practice Research Datalink and Hospital Episode Statistics, we examined the following four approaches to estimating missing gestational age: (1) generalized estimating equations for longitudinal data; (2) multiple imputation; (3) estimation based on fetal birth weight and sex; and (4) conventional approaches that assigned a fixed value (39 weeks for all or 39 weeks for full term and 35 weeks for preterm). The gestational age recorded in Hospital Episode Statistics was considered the gold standard. We conducted a simulation study comparing the described approaches in terms of estimated bias and mean square error. A total of 25,929 infants from 22,774 mothers were included in our "gold standard" cohort. The smallest average absolute bias was observed for the generalized estimating equation that included birth weight, while the largest absolute bias occurred when assigning 39-week gestation to all those with missing values. The smallest mean square errors were detected with generalized estimating equations while multiple imputation had the highest mean square errors. The use of generalized estimating equations resulted in the most accurate estimation of missing gestational age when birth weight information was available. In the absence of birth weight, assignment of fixed gestational age based on term/preterm status may be the optimal approach.

  20. Experimental Determination of Effectiveness of an Individual Information-Centered Approach in Recovering Step-Count Missing Data

    Science.gov (United States)

    Kang, Minsoo; Zhu, Weimo; Tudor-Locke, Catrine; Ainsworth, Barbara

    2005-01-01

    Missing values are a common phenomenon in physical activity research, which has a negative impact on the quality of the data collected. The purpose of this study was to determine empirically the effectiveness of an individual information-centered (II-centered) approach in recovering step-count missing values by comparing the performance of the…

  1. 40 CFR 98.55 - Procedures for estimating missing data.

    Science.gov (United States)

    2010-07-01

    ... accounting purposes (such as sales records). (b) For missing values related to the performance test, including emission factors, production rate, and N2O concentration, you must conduct a new performance test...

  2. 40 CFR 98.225 - Procedures for estimating missing data.

    Science.gov (United States)

    2010-07-01

    ... accounting purposes (such as sales records). (b) For missing values related to the performance test, including emission factors, production rate, and N2O concentration, you must conduct a new performance test...

  3. Use of a Pre-Insertion Resistor to Minimize Zero-Missing Phenomenon and Switching Overvoltages

    DEFF Research Database (Denmark)

    Bak, Claus Leth; da Silva, Filipe Miguel Faria; Gudmundsdottir, Unnur Stella

    2009-01-01

    With the increasing use of High-Voltage Cables, which have different electric characteristics from Overhead Lines, phenomenon like current zero-missing start to appear more often on the transmission systems. Methods to prevent zero-missing phenomenon are still being studied and compared to see wh...... an optimal value of the resistance of the pre-insertion resistor that results in minimizing both the zero-missing phenomenon and switching overvoltages simultaneously....

  4. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels

    NARCIS (Netherlands)

    van Leeuwen, E.M.; Karssen, L.C.; Deelen, J.; Isaacs, A.; Medina-Gomez, C.; Mbarek, H.; Kanterakis, A.; Trompet, S.; Postmus, I.; Verweij, N.; van Enckevort, D.; Huffman, J.E.; White, C.C.; Feitosa, M.F.; Bartz, T.M.; Manichaikul, A.; Joshi, P.K.; Peloso, G.M.; Deelen, P.; Dijk, F.; Willemsen, G.; de Geus, E.J.C.; Milaneschi, Y.; Penninx, B.W.J.H.; Francioli, L.C.; Menelaou, A.; Pulit, S.L.; Rivadeneira, F.; Hofman, A.; Oostra, B.A.; Franco, O.H.; Mateo Leach, I.; Beekman, M.; de Craen, A.J.; Uh, H.W.; Trochet, H.; Hocking, L.J.; Porteous, D.J.; Sattar, N.; Packard, C.J.; Buckley, B.M.; Brody, J.A.; Bis, J.C.; Rotter, J.I.; Mychaleckyj, J.C.; Campbell, H.; Duan, Q.; Lange, L.A.; Wilson, J.F.; Hayward, C.; Polasek, O.; Vitart, V.; Rudan, I.; Wright, A.F.; Rich, S.S.; Psaty, B.M.; Borecki, I.B.; Kearney, P.M.; Stott, D.J.; Cupples, L.A.; Jukema, J.W.; van der Harst, P.; Sijbrands, E.J.; Hottenga, J.J.; Uitterlinden, A.G.; Swertz, M.A.; van Ommen, G.J.B; Bakker, P.I.W.; Slagboom, P.E.; Boomsma, D.I.; Wijmenga, C.; van Duijn, C.M.

    2015-01-01

    Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (∼35,000 samples) with the population-specific reference panel created

  5. Estimating Stand Height and Tree Density in Pinus taeda plantations using in-situ data, airborne LiDAR and k-Nearest Neighbor Imputation

    Directory of Open Access Journals (Sweden)

    CARLOS ALBERTO SILVA

    Full Text Available ABSTRACT Accurate forest inventory is of great economic importance to optimize the entire supply chain management in pulp and paper companies. The aim of this study was to estimate stand dominate and mean heights (HD and HM and tree density (TD of Pinus taeda plantations located in South Brazil using in-situ measurements, airborne Light Detection and Ranging (LiDAR data and the non- k-nearest neighbor (k-NN imputation. Forest inventory attributes and LiDAR derived metrics were calculated at 53 regular sample plots and we used imputation models to retrieve the forest attributes at plot and landscape-levels. The best LiDAR-derived metrics to predict HD, HM and TD were H99TH, HSD, SKE and HMIN. The Imputation model using the selected metrics was more effective for retrieving height than tree density. The model coefficients of determination (adj.R2 and a root mean squared difference (RMSD for HD, HM and TD were 0.90, 0.94, 0.38m and 6.99, 5.70, 12.92%, respectively. Our results show that LiDAR and k-NN imputation can be used to predict stand heights with high accuracy in Pinus taeda. However, furthers studies need to be realized to improve the accuracy prediction of TD and to evaluate and compare the cost of acquisition and processing of LiDAR data against the conventional inventory procedures.

  6. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels

    Science.gov (United States)

    van Leeuwen, Elisabeth M.; Karssen, Lennart C.; Deelen, Joris; Isaacs, Aaron; Medina-Gomez, Carolina; Mbarek, Hamdi; Kanterakis, Alexandros; Trompet, Stella; Postmus, Iris; Verweij, Niek; van Enckevort, David J.; Huffman, Jennifer E.; White, Charles C.; Feitosa, Mary F.; Bartz, Traci M.; Manichaikul, Ani; Joshi, Peter K.; Peloso, Gina M.; Deelen, Patrick; van Dijk, Freerk; Willemsen, Gonneke; de Geus, Eco J.; Milaneschi, Yuri; Penninx, Brenda W.J.H.; Francioli, Laurent C.; Menelaou, Androniki; Pulit, Sara L.; Rivadeneira, Fernando; Hofman, Albert; Oostra, Ben A.; Franco, Oscar H.; Leach, Irene Mateo; Beekman, Marian; de Craen, Anton J.M.; Uh, Hae-Won; Trochet, Holly; Hocking, Lynne J.; Porteous, David J.; Sattar, Naveed; Packard, Chris J.; Buckley, Brendan M.; Brody, Jennifer A.; Bis, Joshua C.; Rotter, Jerome I.; Mychaleckyj, Josyf C.; Campbell, Harry; Duan, Qing; Lange, Leslie A.; Wilson, James F.; Hayward, Caroline; Polasek, Ozren; Vitart, Veronique; Rudan, Igor; Wright, Alan F.; Rich, Stephen S.; Psaty, Bruce M.; Borecki, Ingrid B.; Kearney, Patricia M.; Stott, David J.; Adrienne Cupples, L.; Neerincx, Pieter B.T.; Elbers, Clara C.; Francesco Palamara, Pier; Pe'er, Itsik; Abdellaoui, Abdel; Kloosterman, Wigard P.; van Oven, Mannis; Vermaat, Martijn; Li, Mingkun; Laros, Jeroen F.J.; Stoneking, Mark; de Knijff, Peter; Kayser, Manfred; Veldink, Jan H.; van den Berg, Leonard H.; Byelas, Heorhiy; den Dunnen, Johan T.; Dijkstra, Martijn; Amin, Najaf; Joeri van der Velde, K.; van Setten, Jessica; Kattenberg, Mathijs; van Schaik, Barbera D.C.; Bot, Jan; Nijman, Isaäc J.; Mei, Hailiang; Koval, Vyacheslav; Ye, Kai; Lameijer, Eric-Wubbo; Moed, Matthijs H.; Hehir-Kwa, Jayne Y.; Handsaker, Robert E.; Sunyaev, Shamil R.; Sohail, Mashaal; Hormozdiari, Fereydoun; Marschall, Tobias; Schönhuth, Alexander; Guryev, Victor; Suchiman, H. Eka D.; Wolffenbuttel, Bruce H.; Platteel, Mathieu; Pitts, Steven J.; Potluri, Shobha; Cox, David R.; Li, Qibin; Li, Yingrui; Du, Yuanping; Chen, Ruoyan; Cao, Hongzhi; Li, Ning; Cao, Sujie; Wang, Jun; Bovenberg, Jasper A.; Jukema, J. Wouter; van der Harst, Pim; Sijbrands, Eric J.; Hottenga, Jouke-Jan; Uitterlinden, Andre G.; Swertz, Morris A.; van Ommen, Gert-Jan B.; de Bakker, Paul I.W.; Eline Slagboom, P.; Boomsma, Dorret I.; Wijmenga, Cisca; van Duijn, Cornelia M.

    2015-01-01

    Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (~35,000 samples) with the population-specific reference panel created by the Genome of the Netherlands Project and perform association testing with blood lipid levels. We report the discovery of five novel associations at four loci (P value <6.61 × 10−4), including a rare missense variant in ABCA6 (rs77542162, p.Cys1359Arg, frequency 0.034), which is predicted to be deleterious. The frequency of this ABCA6 variant is 3.65-fold increased in the Dutch and its effect (βLDL-C=0.135, βTC=0.140) is estimated to be very similar to those observed for single variants in well-known lipid genes, such as LDLR. PMID:25751400

  7. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

    NARCIS (Netherlands)

    J. Huang (Jie); B. Howie (Bryan); S. McCarthy (Shane); Y. Memari (Yasin); K. Walter (Klaudia); J.L. Min (Josine L.); P. Danecek (Petr); G. Malerba (Giovanni); E. Trabetti (Elisabetta); H.-F. Zheng (Hou-Feng); G. Gambaro (Giovanni); J.B. Richards (Brent); R. Durbin (Richard); N.J. Timpson (Nicholas); J. Marchini (Jonathan); N. Soranzo (Nicole); S.H. Al Turki (Saeed); A. Amuzu (Antoinette); C. Anderson (Carl); R. Anney (Richard); D. Antony (Dinu); M.S. Artigas; M. Ayub (Muhammad); S. Bala (Senduran); J.C. Barrett (Jeffrey); I.E. Barroso (Inês); P.L. Beales (Philip); M. Benn (Marianne); J. Bentham (Jamie); S. Bhattacharya (Shoumo); E. Birney (Ewan); D.H.R. Blackwood (Douglas); M. Bobrow (Martin); E. Bochukova (Elena); P.F. Bolton (Patrick F.); R. Bounds (Rebecca); C. Boustred (Chris); G. Breen (Gerome); M. Calissano (Mattia); K. Carss (Keren); J.P. Casas (Juan Pablo); J.C. Chambers (John C.); R. Charlton (Ruth); K. Chatterjee (Krishna); L. Chen (Lu); A. Ciampi (Antonio); S. Cirak (Sebahattin); P. Clapham (Peter); G. Clement (Gail); G. Coates (Guy); M. Cocca (Massimiliano); D.A. Collier (David); C. Cosgrove (Catherine); T. Cox (Tony); N.J. Craddock (Nick); L. Crooks (Lucy); S. Curran (Sarah); D. Curtis (David); A. Daly (Allan); I.N.M. Day (Ian N.M.); A.G. Day-Williams (Aaron); G.V. Dedoussis (George); T. Down (Thomas); Y. Du (Yuanping); C.M. van Duijn (Cornelia); I. Dunham (Ian); T. Edkins (Ted); R. Ekong (Rosemary); P. Ellis (Peter); D.M. Evans (David); I.S. Farooqi (I. Sadaf); D.R. Fitzpatrick (David R.); P. Flicek (Paul); J. Floyd (James); A.R. Foley (A. Reghan); C.S. Franklin (Christopher S.); M. Futema (Marta); L. Gallagher (Louise); P. Gasparini (Paolo); T.R. Gaunt (Tom); M. Geihs (Matthias); D. Geschwind (Daniel); C.M.T. Greenwood (Celia); H. Griffin (Heather); D. Grozeva (Detelina); X. Guo (Xiaosen); X. Guo (Xueqin); H. Gurling (Hugh); D. Hart (Deborah); A.E. Hendricks (Audrey E.); P.A. Holmans (Peter A.); L. Huang (Liren); T. Hubbard (Tim); S.E. Humphries (Steve E.); M.E. Hurles (Matthew); P.G. Hysi (Pirro); V. Iotchkova (Valentina); A. Isaacs (Aaron); D.K. Jackson (David K.); Y. Jamshidi (Yalda); J. Johnson (Jon); C. Joyce (Chris); K.J. Karczewski (Konrad); J. Kaye (Jane); T. Keane (Thomas); J.P. Kemp (John); K. Kennedy (Karen); A. Kent (Alastair); J. Keogh (Julia); F. Khawaja (Farrah); M.E. Kleber (Marcus); M. Van Kogelenberg (Margriet); A. Kolb-Kokocinski (Anja); J.S. Kooner (Jaspal S.); G. Lachance (Genevieve); C. Langenberg (Claudia); C. Langford (Cordelia); D. Lawson (Daniel); I. Lee (Irene); E.M. van Leeuwen (Elisa); M. Lek (Monkol); R. Li (Rui); Y. Li (Yingrui); J. Liang (Jieqin); H. Lin (Hong); R. Liu (Ryan); J. Lönnqvist (Jouko); L.R. Lopes (Luis R.); M.C. Lopes (Margarida); J. Luan; D.G. MacArthur (Daniel G.); M. Mangino (Massimo); G. Marenne (Gaëlle); W. März (Winfried); J. Maslen (John); A. Matchan (Angela); I. Mathieson (Iain); P. McGuffin (Peter); A.M. McIntosh (Andrew); A.G. McKechanie (Andrew G.); A. McQuillin (Andrew); S. Metrustry (Sarah); N. Migone (Nicola); H.M. Mitchison (Hannah M.); A. Moayyeri (Alireza); J. Morris (James); R. Morris (Richard); D. Muddyman (Dawn); F. Muntoni; B.G. Nordestgaard (Børge G.); K. Northstone (Kate); M.C. O'donovan (Michael); S. O'Rahilly (Stephen); A. Onoufriadis (Alexandros); K. Oualkacha (Karim); M.J. Owen (Michael J.); A. Palotie (Aarno); K. Panoutsopoulou (Kalliope); V. Parker (Victoria); J.R. Parr (Jeremy R.); L. Paternoster (Lavinia); T. Paunio (Tiina); F. Payne (Felicity); S.J. Payne (Stewart J.); J.R.B. Perry (John); O.P.H. Pietiläinen (Olli); V. Plagnol (Vincent); R.C. Pollitt (Rebecca C.); S. Povey (Sue); M.A. Quail (Michael A.); L. Quaye (Lydia); L. Raymond (Lucy); K. Rehnström (Karola); C.K. Ridout (Cheryl K.); S.M. Ring (Susan); G.R.S. Ritchie (Graham R.S.); N. Roberts (Nicola); R.L. Robinson (Rachel L.); D.B. Savage (David); P.J. Scambler (Peter); S. Schiffels (Stephan); M. Schmidts (Miriam); N. Schoenmakers (Nadia); R.H. Scott (Richard H.); R.A. Scott (Robert); R.K. Semple (Robert K.); E. Serra (Eva); S.I. Sharp (Sally I.); A.C. Shaw (Adam C.); H.A. Shihab (Hashem A.); S.-Y. Shin (So-Youn); D. Skuse (David); K.S. Small (Kerrin); C. Smee (Carol); G.D. Smith; L. Southam (Lorraine); O. Spasic-Boskovic (Olivera); T.D. Spector (Timothy); D. St. Clair (David); B. St Pourcain (Beate); J. Stalker (Jim); E. Stevens (Elizabeth); J. Sun (Jianping); G. Surdulescu (Gabriela); J. Suvisaari (Jaana); P. Syrris (Petros); I. Tachmazidou (Ioanna); R. Taylor (Rohan); J. Tian (Jing); M.D. Tobin (Martin); D. Toniolo (Daniela); M. Traglia (Michela); A. Tybjaerg-Hansen; A.M. Valdes; A.M. Vandersteen (Anthony M.); A. Varbo (Anette); P. Vijayarangakannan (Parthiban); P.M. Visscher (Peter); L.V. Wain (Louise); J.T. Walters (James); G. Wang (Guangbiao); J. Wang (Jun); Y. Wang (Yu); K. Ward (Kirsten); E. Wheeler (Eleanor); P.H. Whincup (Peter); T. Whyte (Tamieka); H.J. Williams (Hywel J.); K.A. Williamson (Kathleen); C. Wilson (Crispian); S.G. Wilson (Scott); K. Wong (Kim); C. Xu (Changjiang); J. Yang (Jian); G. Zaza (Gianluigi); E. Zeggini (Eleftheria); F. Zhang (Feng); P. Zhang (Pingbo); W. Zhang (Weihua)

    2015-01-01

    textabstractImputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced

  8. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

    DEFF Research Database (Denmark)

    Huang, Jie; Howie, Bryan; Mccarthy, Shane

    2015-01-01

    Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low de...

  9. Evaluating Functional Diversity: Missing Trait Data and the Importance of Species Abundance Structure and Data Transformation

    Science.gov (United States)

    Bryndová, Michala; Kasari, Liis; Norberg, Anna; Weiss, Matthias; Bishop, Tom R.; Luke, Sarah H.; Sam, Katerina; Le Bagousse-Pinguet, Yoann; Lepš, Jan; Götzenberger, Lars; de Bello, Francesco

    2016-01-01

    Functional diversity (FD) is an important component of biodiversity that quantifies the difference in functional traits between organisms. However, FD studies are often limited by the availability of trait data and FD indices are sensitive to data gaps. The distribution of species abundance and trait data, and its transformation, may further affect the accuracy of indices when data is incomplete. Using an existing approach, we simulated the effects of missing trait data by gradually removing data from a plant, an ant and a bird community dataset (12, 59, and 8 plots containing 62, 297 and 238 species respectively). We ranked plots by FD values calculated from full datasets and then from our increasingly incomplete datasets and compared the ranking between the original and virtually reduced datasets to assess the accuracy of FD indices when used on datasets with increasingly missing data. Finally, we tested the accuracy of FD indices with and without data transformation, and the effect of missing trait data per plot or per the whole pool of species. FD indices became less accurate as the amount of missing data increased, with the loss of accuracy depending on the index. But, where transformation improved the normality of the trait data, FD values from incomplete datasets were more accurate than before transformation. The distribution of data and its transformation are therefore as important as data completeness and can even mitigate the effect of missing data. Since the effect of missing trait values pool-wise or plot-wise depends on the data distribution, the method should be decided case by case. Data distribution and data transformation should be given more careful consideration when designing, analysing and interpreting FD studies, especially where trait data are missing. To this end, we provide the R package “traitor” to facilitate assessments of missing trait data. PMID:26881747

  10. Imputing historical statistics, soils information, and other land-use data to crop area

    Science.gov (United States)

    Perry, C. R., Jr.; Willis, R. W.; Lautenschlager, L.

    1982-01-01

    In foreign crop condition monitoring, satellite acquired imagery is routinely used. To facilitate interpretation of this imagery, it is advantageous to have estimates of the crop types and their extent for small area units, i.e., grid cells on a map represent, at 60 deg latitude, an area nominally 25 by 25 nautical miles in size. The feasibility of imputing historical crop statistics, soils information, and other ancillary data to crop area for a province in Argentina is studied.

  11. Absence of missing mass in clusters

    Energy Technology Data Exchange (ETDEWEB)

    Fesenko, B.I.

    1981-01-01

    The evaluation of the radial-velocity dispersion in clusters of galaxies is considered by using order statistics to reduce the distortions introduced by foreground and background galaxies and by large errors in velocity measurement. For four nearby clusters of galaxies, including the Coma cluster, velocity dispersions are obtained which are approximately four times lower than previously reported values. It is found that more remote galaxies exhibit the same tendency to a reduction in missing mass. A detailed examination of the statistical properties of galaxies in the Virgo cluster reveals that the cluster might not actually exist, but may be just a large excess in the number density of bright galaxies, possibly the result of an increase in the visibility of objects in the appropriate directions. It is concluded that the presence of a large amount of missing mass in clusters of galaxies is yet to be proven.

  12. Statistical Analysis of a Class: Monte Carlo and Multiple Imputation Spreadsheet Methods for Estimation and Extrapolation

    Science.gov (United States)

    Fish, Laurel J.; Halcoussis, Dennis; Phillips, G. Michael

    2017-01-01

    The Monte Carlo method and related multiple imputation methods are traditionally used in math, physics and science to estimate and analyze data and are now becoming standard tools in analyzing business and financial problems. However, few sources explain the application of the Monte Carlo method for individuals and business professionals who are…

  13. Imputation of single nucleotide polymorhpism genotypes of Hereford cattle: reference panel size, family relationship and population structure

    Science.gov (United States)

    The objective of this study is to investigate single nucleotide polymorphism (SNP) genotypes imputation of Hereford cattle. Purebred Herefords were from two sources, Line 1 Hereford (N=240) and representatives of Industry Herefords (N=311). Using different reference panels of 62 and 494 males with 1...

  14. New or ν missing energy

    DEFF Research Database (Denmark)

    Franzosi, Diogo Buarque; Frandsen, Mads T.; Shoemaker, Ian M.

    2016-01-01

    flavor structures. Monojet data alone can be used to infer the mass of the "missing particle" from the shape of the missing energy distribution. In particular, 13 TeV LHC data will have sensitivity to DM masses greater than $\\sim$ 1 TeV. In addition to the monojet channel, NSI can be probed in multi......Missing energy signals such as monojets are a possible signature of Dark Matter (DM) at colliders. However, neutrino interactions beyond the Standard Model may also produce missing energy signals. In order to conclude that new "missing particles" are observed the hypothesis of BSM neutrino......-lepton searches which we find to yield stronger limits at heavy mediator masses. The sensitivity offered by these multi-lepton channels provide a method to reject or confirm the DM hypothesis in missing energy searches....

  15. Extraction of the missing low-Q data from small-angle scattering data

    International Nuclear Information System (INIS)

    Doherty, G.K.; Poland, G.A.

    1996-01-01

    A new method is presented that recovers the scatter intensity curve at low Q values from small-angle neutron scattering data. The method uses only the measured data, requiring no extrapolation of the scatter curve nor any a priori knowledge of the maximum chord length, radius of gyration or molecular weight of the particle under investigation. It is assumed that the incoherent level would have been extracted from the data in the normal course of events but any errors do not affect the method presented. The distance distribution function of any particle has a restricted extension in real space and any nonzero value beyond the maximum size of the particle is due to the effects of missing data segments and noise on the measured data. The effects due to the missing and noisy data are isolated from the distance distribution of the particle and a suitably scaled template particle is used to fill in the missing distribution data segment. The inverse transform of the new distribution function returns the missing low-Q scatter data and to some extent cancels out the noise. While the method is generally explored using noise-free analytically derived particles, its application to real experimental data is demonstrated. (orig.)

  16. Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans.

    Science.gov (United States)

    Gottlieb, Assaf; Daneshjou, Roxana; DeGorter, Marianne; Bourgeois, Stephane; Svensson, Peter J; Wadelius, Mia; Deloukas, Panos; Montgomery, Stephen B; Altman, Russ B

    2017-11-24

    Genome-wide association studies are useful for discovering genotype-phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into "gene level" effects. Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression-on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort.

  17. 31 CFR 19.630 - May the Department of the Treasury impute conduct of one person to another?

    Science.gov (United States)

    2010-07-01

    ... 31 Money and Finance: Treasury 1 2010-07-01 2010-07-01 false May the Department of the Treasury impute conduct of one person to another? 19.630 Section 19.630 Money and Finance: Treasury Office of the Secretary of the Treasury GOVERNMENTWIDE DEBARMENT AND SUSPENSION (NONPROCUREMENT) General Principles...

  18. The Use of Imputed Sibling Genotypes in Sibship-Based Association Analysis: On Modeling Alternatives, Power and Model Misspecification

    NARCIS (Netherlands)

    Minica, C.C.; Dolan, C.V.; Willemsen, G.; Vink, J.M.; Boomsma, D.I.

    2013-01-01

    When phenotypic, but no genotypic data are available for relatives of participants in genetic association studies, previous research has shown that family-based imputed genotypes can boost the statistical power when included in such studies. Here, using simulations, we compared the performance of

  19. Semiautomatic imputation of activity travel diaries : use of global positioning system traces, prompted recall, and context-sensitive learning algorithms

    NARCIS (Netherlands)

    Moiseeva, A.; Jessurun, A.J.; Timmermans, H.J.P.; Stopher, P.

    2016-01-01

    Anastasia Moiseeva, Joran Jessurun and Harry Timmermans (2010), ‘Semiautomatic Imputation of Activity Travel Diaries: Use of Global Positioning System Traces, Prompted Recall, and Context-Sensitive Learning Algorithms’, Transportation Research Record: Journal of the Transportation Research Board,

  20. Estimating cavity tree and snag abundance using negative binomial regression models and nearest neighbor imputation methods

    Science.gov (United States)

    Bianca N.I. Eskelson; Hailemariam Temesgen; Tara M. Barrett

    2009-01-01

    Cavity tree and snag abundance data are highly variable and contain many zero observations. We predict cavity tree and snag abundance from variables that are readily available from forest cover maps or remotely sensed data using negative binomial (NB), zero-inflated NB, and zero-altered NB (ZANB) regression models as well as nearest neighbor (NN) imputation methods....

  1. Non-imputability, criminal dangerousness and curative safety measures: myths and realities

    Directory of Open Access Journals (Sweden)

    Frank Harbottle Quirós

    2017-04-01

    Full Text Available The curative safety measures are imposed in a criminal proceeding to the non-imputable people provided that through a prognosis it is concluded in an affirmative way about its criminal dangerousness. Although this statement seems very elementary, in judicial practice several myths remain in relation to these legal institutes whose versions may vary, to a greater or lesser extent, between the different countries of the world. In this context, the present article formulates ten myths based on the experience of Costa Rica and provides an explanation that seeks to weaken or knock them down, inviting the reader to reflect on them.

  2. 29 CFR 1471.630 - May the Federal Mediation and Conciliation Service impute conduct of one person to another?

    Science.gov (United States)

    2010-07-01

    ... 29 Labor 4 2010-07-01 2010-07-01 false May the Federal Mediation and Conciliation Service impute...) FEDERAL MEDIATION AND CONCILIATION SERVICE GOVERNMENTWIDE DEBARMENT AND SUSPENSION (NONPROCUREMENT) General Principles Relating to Suspension and Debarment Actions § 1471.630 May the Federal Mediation and...

  3. Analyzing the effect of selected control policy measures and sociodemographic factors on alcoholic beverage consumption in Europe within the AMPHORA project: statistical methods.

    Science.gov (United States)

    Baccini, Michela; Carreras, Giulia

    2014-10-01

    This paper describes the methods used to investigate variations in total alcoholic beverage consumption as related to selected control intervention policies and other socioeconomic factors (unplanned factors) within 12 European countries involved in the AMPHORA project. The analysis presented several critical points: presence of missing values, strong correlation among the unplanned factors, long-term waves or trends in both the time series of alcohol consumption and the time series of the main explanatory variables. These difficulties were addressed by implementing a multiple imputation procedure for filling in missing values, then specifying for each country a multiple regression model which accounted for time trend, policy measures and a limited set of unplanned factors, selected in advance on the basis of sociological and statistical considerations are addressed. This approach allowed estimating the "net" effect of the selected control policies on alcohol consumption, but not the association between each unplanned factor and the outcome.

  4. Restoring method for missing data of spatial structural stress monitoring based on correlation

    Science.gov (United States)

    Zhang, Zeyu; Luo, Yaozhi

    2017-07-01

    Long-term monitoring of spatial structures is of great importance for the full understanding of their performance and safety. The missing part of the monitoring data link will affect the data analysis and safety assessment of the structure. Based on the long-term monitoring data of the steel structure of the Hangzhou Olympic Center Stadium, the correlation between the stress change of the measuring points is studied, and an interpolation method of the missing stress data is proposed. Stress data of correlated measuring points are selected in the 3 months of the season when missing data is required for fitting correlation. Data of daytime and nighttime are fitted separately for interpolation. For a simple linear regression when single point's correlation coefficient is 0.9 or more, the average error of interpolation is about 5%. For multiple linear regression, the interpolation accuracy is not significantly increased after the number of correlated points is more than 6. Stress baseline value of construction step should be calculated before interpolating missing data in the construction stage, and the average error is within 10%. The interpolation error of continuous missing data is slightly larger than that of the discrete missing data. The data missing rate of this method should better not exceed 30%. Finally, a measuring point's missing monitoring data is restored to verify the validity of the method.

  5. Semiparametric Theory and Missing Data

    CERN Document Server

    Tsiatis, Anastasios A

    2006-01-01

    Missing data arise in almost all scientific disciplines. In many cases, missing data in an analysis is treated in a casual and ad-hoc manner, leading to invalid inferences and erroneous conclusions. This book summarizes knowledge regarding the theory of estimation for semiparametric models with missing data.

  6. A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method

    OpenAIRE

    Jun-He Yang; Ching-Hsue Cheng; Chia-Pan Chan

    2017-01-01

    Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting m...

  7. Universal Linear Fit Identification: A Method Independent of Data, Outliers and Noise Distribution Model and Free of Missing or Removed Data Imputation.

    Science.gov (United States)

    Adikaram, K K L B; Hussein, M A; Effenberger, M; Becker, T

    2015-01-01

    Data processing requires a robust linear fit identification method. In this paper, we introduce a non-parametric robust linear fit identification method for time series. The method uses an indicator 2/n to identify linear fit, where n is number of terms in a series. The ratio Rmax of amax - amin and Sn - amin*n and that of Rmin of amax - amin and amax*n - Sn are always equal to 2/n, where amax is the maximum element, amin is the minimum element and Sn is the sum of all elements. If any series expected to follow y = c consists of data that do not agree with y = c form, Rmax > 2/n and Rmin > 2/n imply that the maximum and minimum elements, respectively, do not agree with linear fit. We define threshold values for outliers and noise detection as 2/n * (1 + k1) and 2/n * (1 + k2), respectively, where k1 > k2 and 0 ≤ k1 ≤ n/2 - 1. Given this relation and transformation technique, which transforms data into the form y = c, we show that removing all data that do not agree with linear fit is possible. Furthermore, the method is independent of the number of data points, missing data, removed data points and nature of distribution (Gaussian or non-Gaussian) of outliers, noise and clean data. These are major advantages over the existing linear fit methods. Since having a perfect linear relation between two variables in the real world is impossible, we used artificial data sets with extreme conditions to verify the method. The method detects the correct linear fit when the percentage of data agreeing with linear fit is less than 50%, and the deviation of data that do not agree with linear fit is very small, of the order of ±10-4%. The method results in incorrect detections only when numerical accuracy is insufficient in the calculation process.

  8. Universal Linear Fit Identification: A Method Independent of Data, Outliers and Noise Distribution Model and Free of Missing or Removed Data Imputation.

    Directory of Open Access Journals (Sweden)

    K K L B Adikaram

    Full Text Available Data processing requires a robust linear fit identification method. In this paper, we introduce a non-parametric robust linear fit identification method for time series. The method uses an indicator 2/n to identify linear fit, where n is number of terms in a series. The ratio Rmax of amax - amin and Sn - amin*n and that of Rmin of amax - amin and amax*n - Sn are always equal to 2/n, where amax is the maximum element, amin is the minimum element and Sn is the sum of all elements. If any series expected to follow y = c consists of data that do not agree with y = c form, Rmax > 2/n and Rmin > 2/n imply that the maximum and minimum elements, respectively, do not agree with linear fit. We define threshold values for outliers and noise detection as 2/n * (1 + k1 and 2/n * (1 + k2, respectively, where k1 > k2 and 0 ≤ k1 ≤ n/2 - 1. Given this relation and transformation technique, which transforms data into the form y = c, we show that removing all data that do not agree with linear fit is possible. Furthermore, the method is independent of the number of data points, missing data, removed data points and nature of distribution (Gaussian or non-Gaussian of outliers, noise and clean data. These are major advantages over the existing linear fit methods. Since having a perfect linear relation between two variables in the real world is impossible, we used artificial data sets with extreme conditions to verify the method. The method detects the correct linear fit when the percentage of data agreeing with linear fit is less than 50%, and the deviation of data that do not agree with linear fit is very small, of the order of ±10-4%. The method results in incorrect detections only when numerical accuracy is insufficient in the calculation process.

  9. Dealing with missing data in remote sensing images within land and crop classification

    Science.gov (United States)

    Skakun, Sergii; Kussul, Nataliia; Basarab, Ruslan

    Optical remote sensing images from space provide valuable data for environmental monitoring, disaster management [1], agriculture mapping [2], so forth. In many cases, a time-series of satellite images is used to discriminate or estimate particular land parameters. One of the factors that influence the efficiency of satellite imagery is the presence of clouds. This leads to the occurrence of missing data that need to be addressed. Numerous approaches have been proposed to fill in missing data (or gaps) and can be categorized into inpainting-based, multispectral-based, and multitemporal-based. In [3], ancillary MODIS data are utilized for filling gaps and predicting Landsat data. In this paper we propose to use self-organizing Kohonen maps (SOMs) for missing data restoration in time-series of satellite imagery. Such approach was previously used for MODIS data [4], but applying this approach for finer spatial resolution data such as Sentinel-2 and Landsat-8 represents a challenge. Moreover, data for training the SOMs are selected manually in [4] that complicates the use of the method in an automatic mode. SOM is a type of artificial neural network that is trained using unsupervised learning to produce a discretised representation of the input space of the training samples, called a map. The map seeks to preserve the topological properties of the input space. The reconstruction of satellite images is performed for each spectral band separately, i.e. a separate SOM is trained for each spectral band. Pixels that have no missing values in the time-series are selected for training. Selecting the number of training pixels represent a trade-off, in particular increasing the number of training samples will lead to the increased time of SOM training while increasing the quality of restoration. Also, training data sets should be selected automatically. As such, we propose to select training samples on a regular grid of pixels. Therefore, the SOM seeks to project a large number

  10. Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans

    Directory of Open Access Journals (Sweden)

    Assaf Gottlieb

    2017-11-01

    Full Text Available Abstract Background Genome-wide association studies are useful for discovering genotype–phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into “gene level” effects. Methods Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression—on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. Results We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Conclusions Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort

  11. Impute DC link (IDCL) cell based power converters and control thereof

    Science.gov (United States)

    Divan, Deepakraj M.; Prasai, Anish; Hernendez, Jorge; Moghe, Rohit; Iyer, Amrit; Kandula, Rajendra Prasad

    2016-04-26

    Power flow controllers based on Imputed DC Link (IDCL) cells are provided. The IDCL cell is a self-contained power electronic building block (PEBB). The IDCL cell may be stacked in series and parallel to achieve power flow control at higher voltage and current levels. Each IDCL cell may comprise a gate drive, a voltage sharing module, and a thermal management component in order to facilitate easy integration of the cell into a variety of applications. By providing direct AC conversion, the IDCL cell based AC/AC converters reduce device count, eliminate the use of electrolytic capacitors that have life and reliability issues, and improve system efficiency compared with similarly rated back-to-back inverter system.

  12. Identification of significant features by the Global Mean Rank test.

    Science.gov (United States)

    Klammer, Martin; Dybowski, J Nikolaj; Hoffmann, Daniel; Schaab, Christoph

    2014-01-01

    With the introduction of omics-technologies such as transcriptomics and proteomics, numerous methods for the reliable identification of significantly regulated features (genes, proteins, etc.) have been developed. Experimental practice requires these tests to successfully deal with conditions such as small numbers of replicates, missing values, non-normally distributed expression levels, and non-identical distributions of features. With the MeanRank test we aimed at developing a test that performs robustly under these conditions, while favorably scaling with the number of replicates. The test proposed here is a global one-sample location test, which is based on the mean ranks across replicates, and internally estimates and controls the false discovery rate. Furthermore, missing data is accounted for without the need of imputation. In extensive simulations comparing MeanRank to other frequently used methods, we found that it performs well with small and large numbers of replicates, feature dependent variance between replicates, and variable regulation across features on simulation data and a recent two-color microarray spike-in dataset. The tests were then used to identify significant changes in the phosphoproteomes of cancer cells induced by the kinase inhibitors erlotinib and 3-MB-PP1 in two independently published mass spectrometry-based studies. MeanRank outperformed the other global rank-based methods applied in this study. Compared to the popular Significance Analysis of Microarrays and Linear Models for Microarray methods, MeanRank performed similar or better. Furthermore, MeanRank exhibits more consistent behavior regarding the degree of regulation and is robust against the choice of preprocessing methods. MeanRank does not require any imputation of missing values, is easy to understand, and yields results that are easy to interpret. The software implementing the algorithm is freely available for academic and commercial use.

  13. Prediction of Missing Streamflow Data using Principle of Information Entropy

    Directory of Open Access Journals (Sweden)

    Santosa, B.

    2014-01-01

    Full Text Available Incomplete (missing of streamflow data often occurs. This can be caused by a not continous data recording or poor storage. In this study, missing consecutive streamflow data are predicted using the principle of information entropy. Predictions are performed ​​using the complete monthly streamflow information from the nearby river. Data on average monthly streamflow used as a simulation sample are taken from observation stations Katulampa, Batubeulah, and Genteng, which are the Ciliwung Cisadane river areas upstream. The simulated prediction of missing streamflow data in 2002 and 2003 at Katulampa Station are based on information from Genteng Station, and Batubeulah Station. The mean absolute error (MAE average obtained was 0,20 and 0,21 in 2002 and the MAE average in 2003 was 0,12 and 0,16. Based on the value of the error and pattern of filled gaps, this method has the potential to be developed further.

  14. Single-nucleotide polymorphism discovery by high-throughput sequencing in sorghum

    Directory of Open Access Journals (Sweden)

    White Frank F

    2011-07-01

    Full Text Available Abstract Background Eight diverse sorghum (Sorghum bicolor L. Moench accessions were subjected to short-read genome sequencing to characterize the distribution of single-nucleotide polymorphisms (SNPs. Two strategies were used for DNA library preparation. Missing SNP genotype data were imputed by local haplotype comparison. The effect of library type and genomic diversity on SNP discovery and imputation are evaluated. Results Alignment of eight genome equivalents (6 Gb to the public reference genome revealed 283,000 SNPs at ≥82% confirmation probability. Sequencing from libraries constructed to limit sequencing to start at defined restriction sites led to genotyping 10-fold more SNPs in all 8 accessions, and correctly imputing 11% more missing data, than from semirandom libraries. The SNP yield advantage of the reduced-representation method was less than expected, since up to one fifth of reads started at noncanonical restriction sites and up to one third of restriction sites predicted in silico to yield unique alignments were not sampled at near-saturation. For imputation accuracy, the availability of a genomically similar accession in the germplasm panel was more important than panel size or sequencing coverage. Conclusions A sequence quantity of 3 million 50-base reads per accession using a BsrFI library would conservatively provide satisfactory genotyping of 96,000 sorghum SNPs. For most reliable SNP-genotype imputation in shallowly sequenced genomes, germplasm panels should consist of pairs or groups of genomically similar entries. These results may help in designing strategies for economical genotyping-by-sequencing of large numbers of plant accessions.

  15. Miss Lora juveelikauplus = Miss Lora jewellery store

    Index Scriptorium Estoniae

    2009-01-01

    Narvas Fama kaubanduskeskuses (Tallinna mnt. 19c) asuva juveelikaupluse Miss Lora sisekujundusest. Sisearhitektid Annes Arro ja Hanna Karits. Poe sisu - vitriinkapid, vaip, valgustid - on valmistatud eritellimusel. Sisearhitektide tähtsamate tööde loetelu

  16. Penalized regression procedures for variable selection in the potential outcomes framework.

    Science.gov (United States)

    Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L

    2015-05-10

    A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple 'impute, then select' class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data, and imputation are drawn. A difference least absolute shrinkage and selection operator algorithm is defined, along with its multiple imputation analogs. The procedures are illustrated using a well-known right-heart catheterization dataset. Copyright © 2015 John Wiley & Sons, Ltd.

  17. GRIMP: A web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data.

    NARCIS (Netherlands)

    K. Estrada Gil (Karol); A. Abuseiris (Anis); F.G. Grosveld (Frank); A.G. Uitterlinden (André); T.A. Knoch (Tobias); F. Rivadeneira Ramirez (Fernando)

    2009-01-01

    textabstractThe current fast growth of genome-wide association studies (GWAS) combined with now common computationally expensive imputation requires the online access of large user groups to high-performance computing resources capable of analyzing rapidly and efficiently millions of genetic

  18. Who cares and how much? The imputed economic contribution to the Canadian healthcare system of middle-aged and older unpaid caregivers providing care to the elderly.

    Science.gov (United States)

    Hollander, Marcus J; Liu, Guiping; Chappell, Neena L

    2009-01-01

    Canadians provide significant amounts of unpaid care to elderly family members and friends with long-term health problems. While some information is available on the nature of the tasks unpaid caregivers perform, and the amounts of time they spend on these tasks, the contribution of unpaid caregivers is often hidden. (It is recognized that some caregiving may be for short periods of time or may entail matters better described as "help" or "assistance," such as providing transportation. However, we use caregiving to cover the full range of unpaid care provided from some basic help to personal care.) Aggregate estimates of the market costs to replace the unpaid care provided are important to governments for policy development as they provide a means to situate the contributions of unpaid caregivers within Canada's healthcare system. The purpose of this study was to obtain an assessment of the imputed costs of replacing the unpaid care provided by Canadians to the elderly. (Imputed costs is used to refer to costs that would be incurred if the care provided by an unpaid caregiver was, instead, provided by a paid caregiver, on a direct hour-for-hour substitution basis.) The economic value of unpaid care as understood in this study is defined as the cost to replace the services provided by unpaid caregivers at rates for paid care providers.

  19. Pragmatic criteria of the definition of neonatal near miss: a comparative study.

    Science.gov (United States)

    Kale, Pauline Lorena; Jorge, Maria Helena Prado de Mello; Laurenti, Ruy; Fonseca, Sandra Costa; Silva, Kátia Silveira da

    2017-12-04

    The objective of this study was to test the validity of the pragmatic criteria of the definitions of neonatal near miss, extending them throughout the infant period, and to estimate the indicators of perinatal care in public maternity hospitals. A cohort of live births from six maternity hospitals in the municipalities of São Paulo, Niterói, and Rio de Janeiro, Brazil, was carried out in 2011. We carried out interviews and checked prenatal cards and medical records. We compared the pragmatic criteria (birth weight, gestational age, and 5' Apgar score) of the definitions of near miss of Pileggi et al., Pileggi-Castro et al., Souza et al., and Silva et al. We calculated sensitivity, specificity (gold standard: infant mortality), percentage of deaths among newborns with life-threatening conditions, and rates of near miss, mortality, and severe outcomes per 1,000 live births. A total 7,315 newborns were analyzed (completeness of information > 99%). The sensitivity of the definition of Pileggi-Castro et al. was higher, resulting in a higher number of cases of near miss, Souza et al. presented lower value, and Pileggi et al. and de Silva et al. presented intermediate values. There is an increase in sensitivity when the period goes from 0-6 to 0-27 days, and there is a decrease when it goes to 0-364 days. Specificities were high (≥ 97%) and above sensitivities (54% to 77%). One maternity hospital in São Paulo and one in Niterói presented, respectively, the lowest and highest rates of infant mortality, near miss, and frequency of births with life-threatening conditions, regardless of the definition. The definitions of near miss based exclusively on pragmatic criteria are valid and can be used for monitoring purposes. Based on the perinatal literature, the cutoff points adopted by Silva et al. were more appropriate. Periodic studies could apply a more complete definition, incorporating clinical, laboratory, and management criteria, including congenital anomalies

  20. Mapping wildland fuels and forest structure for land management: a comparison of nearest neighbor imputation and other methods

    Science.gov (United States)

    Kenneth B. Pierce; Janet L. Ohmann; Michael C. Wimberly; Matthew J. Gregory; Jeremy S. Fried

    2009-01-01

    Land managers need consistent information about the geographic distribution of wildland fuels and forest structure over large areas to evaluate fire risk and plan fuel treatments. We compared spatial predictions for 12 fuel and forest structure variables across three regions in the western United States using gradient nearest neighbor (GNN) imputation, linear models (...

  1. Missing outcome data in a naturalistic psychodynamic group therapy study

    DEFF Research Database (Denmark)

    Jensen, Hans Henrik; Mortensen, Erik Lykke; Lotz, Martin

    2017-01-01

    might have had a reliable improvement in GSI. The SPSS standard statistical imputations procedure estimated that 48.6% of the patients reliably improved in GSI, and 50.2% when therapist evaluations were not included. It is concluded that therapist evaluations are essential in order to avoid bias...

  2. Impact of measles supplementary immunization activities on reaching children missed by routine programs.

    Science.gov (United States)

    Portnoy, Allison; Jit, Mark; Helleringer, Stéphane; Verguet, Stéphane

    2018-01-02

    Measles supplementary immunization activities (SIAs) are vaccination campaigns that supplement routine vaccination programs with a recommended second dose opportunity to children of different ages regardless of their previous history of measles vaccination. They are conducted every 2-4 years and over a few weeks in many low- and middle-income countries. While SIAs have high vaccination coverage, it is unclear whether they reach the children who miss their routine measles vaccine dose. Determining who is reached by SIAs is vital to understanding their effectiveness, as well as measure progress towards measles control. We examined SIAs in low- and middle-income countries from 2000 to 2014 using data from the Demographic and Health Surveys. Conditional on a child's routine measles vaccination status, we examined whether children participated in the most recent measles SIA. The average proportion of zero-dose children (no previous routine measles vaccination defined as no vaccination date before the SIA) reached by SIAs across 14 countries was 66%, ranging from 28% in São Tomé and Príncipe to 91% in Nigeria. However, when also including all children with routine measles vaccination data, this proportion decreased to 12% and to 58% when imputing data for children with vaccination reported by the mother and vaccination marks on the vaccination card across countries. Overall, the proportions of zero-dose children reached by SIAs declined with increasing household wealth. Some countries appeared to reach a higher proportion of zero-dose children using SIAs than others, with proportions reached varying according to the definition of measles vaccination (e.g., vaccination dates on the vaccination card, vaccination marks on the vaccination card, and/or self-reported data). This suggests that some countries could improve their targeting of SIAs to children who miss other measles vaccine opportunities. Across all countries, SIAs played an important role in reaching

  3. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions

    Directory of Open Access Journals (Sweden)

    Balding David J

    2008-12-01

    Full Text Available Abstract Background The power of haplotype-based methods for association studies, identification of regions under selection, and ancestral inference, is well-established for diploid organisms. For polyploids, however, the difficulty of determining phase has limited such approaches. Polyploidy is common in plants and is also observed in animals. Partial polyploidy is sometimes observed in humans (e.g. trisomy 21; Down's syndrome, and it arises more frequently in some human tissues. Local changes in ploidy, known as copy number variations (CNV, arise throughout the genome. Here we present a method, implemented in the software polyHap, for the inference of haplotype phase and missing observations from polyploid genotypes. PolyHap allows each individual to have a different ploidy, but ploidy cannot vary over the genomic region analysed. It employs a hidden Markov model (HMM and a sampling algorithm to infer haplotypes jointly in multiple individuals and to obtain a measure of uncertainty in its inferences. Results In the simulation study, we combine real haplotype data to create artificial diploid, triploid, and tetraploid genotypes, and use these to demonstrate that polyHap performs well, in terms of both switch error rate in recovering phase and imputation error rate for missing genotypes. To our knowledge, there is no comparable software for phasing a large, densely genotyped region of chromosome from triploids and tetraploids, while for diploids we found polyHap to be more accurate than fastPhase. We also compare the results of polyHap to SATlotyper on an experimentally haplotyped tetraploid dataset of 12 SNPs, and show that polyHap is more accurate. Conclusion With the availability of large SNP data in polyploids and CNV regions, we believe that polyHap, our proposed method for inferring haplotypic phase from genotype data, will be useful in enabling researchers analysing such data to exploit the power of haplotype-based analyses.

  4. Accuracy of genomic selection for alfalfa biomass yield in different reference populations.

    Science.gov (United States)

    Annicchiarico, Paolo; Nazzicari, Nelson; Li, Xuehui; Wei, Yanling; Pecetti, Luciano; Brummer, E Charles

    2015-12-01

    Genomic selection based on genotyping-by-sequencing (GBS) data could accelerate alfalfa yield gains, if it displayed moderate ability to predict parent breeding values. Its interest would be enhanced by predicting ability also for germplasm/reference populations other than those for which it was defined. Predicting accuracy may be influenced by statistical models, SNP calling procedures and missing data imputation strategies. Landrace and variety material from two genetically-contrasting reference populations, i.e., 124 elite genotypes adapted to the Po Valley (sub-continental climate; PV population) and 154 genotypes adapted to Mediterranean-climate environments (Me population), were genotyped by GBS and phenotyped in separate environments for dry matter yield of their dense-planted half-sib progenies. Both populations showed no sub-population genetic structure. Predictive accuracy was higher by joint rather than separate SNP calling for the two data sets, and using random forest imputation of missing data. Highest accuracy was obtained using Support Vector Regression (SVR) for PV, and Ridge Regression BLUP and SVR for Me germplasm. Bayesian methods (Bayes A, Bayes B and Bayesian Lasso) tended to be less accurate. Random Forest Regression was the least accurate model. Accuracy attained about 0.35 for Me in the range of 0.30-0.50 missing data, and 0.32 for PV at 0.50 missing data, using at least 10,000 SNP markers. Cross-population predictions based on a smaller subset of common SNPs implied a relative loss of accuracy of about 25% for Me and 30% for PV. Genome-wide association analyses based on large subsets of M. truncatula-aligned markers revealed many SNPs with modest association with yield, and some genome areas hosting putative QTLs. A comparison of genomic vs. conventional selection for parent breeding value assuming 1-year vs. 5-year selection cycles, respectively, indicated over three-fold greater predicted yield gain per unit time for genomic selection

  5. Genomic evaluations with many more genotypes

    Directory of Open Access Journals (Sweden)

    Wiggans George R

    2011-03-01

    Full Text Available Abstract Background Genomic evaluations in Holstein dairy cattle have quickly become more reliable over the last two years in many countries as more animals have been genotyped for 50,000 markers. Evaluations can also include animals genotyped with more or fewer markers using new tools such as the 777,000 or 2,900 marker chips recently introduced for cattle. Gains from more markers can be predicted using simulation, whereas strategies to use fewer markers have been compared using subsets of actual genotypes. The overall cost of selection is reduced by genotyping most animals at less than the highest density and imputing their missing genotypes using haplotypes. Algorithms to combine different densities need to be efficient because numbers of genotyped animals and markers may continue to grow quickly. Methods Genotypes for 500,000 markers were simulated for the 33,414 Holsteins that had 50,000 marker genotypes in the North American database. Another 86,465 non-genotyped ancestors were included in the pedigree file, and linkage disequilibrium was generated directly in the base population. Mixed density datasets were created by keeping 50,000 (every tenth of the markers for most animals. Missing genotypes were imputed using a combination of population haplotyping and pedigree haplotyping. Reliabilities of genomic evaluations using linear and nonlinear methods were compared. Results Differing marker sets for a large population were combined with just a few hours of computation. About 95% of paternal alleles were determined correctly, and > 95% of missing genotypes were called correctly. Reliability of breeding values was already high (84.4% with 50,000 simulated markers. The gain in reliability from increasing the number of markers to 500,000 was only 1.6%, but more than half of that gain resulted from genotyping just 1,406 young bulls at higher density. Linear genomic evaluations had reliabilities 1.5% lower than the nonlinear evaluations with 50

  6. Forecasting air quality time series using deep learning.

    Science.gov (United States)

    Freeman, Brian S; Taylor, Graham; Gharabaghi, Bahram; Thé, Jesse

    2018-04-13

    This paper presents one of the first applications of deep learning (DL) techniques to predict air pollution time series. Air quality management relies extensively on time series data captured at air monitoring stations as the basis of identifying population exposure to airborne pollutants and determining compliance with local ambient air standards. In this paper, 8 hr averaged surface ozone (O 3 ) concentrations were predicted using deep learning consisting of a recurrent neural network (RNN) with long short-term memory (LSTM). Hourly air quality and meteorological data were used to train and forecast values up to 72 hours with low error rates. The LSTM was able to forecast the duration of continuous O 3 exceedances as well. Prior to training the network, the dataset was reviewed for missing data and outliers. Missing data were imputed using a novel technique that averaged gaps less than eight time steps with incremental steps based on first-order differences of neighboring time periods. Data were then used to train decision trees to evaluate input feature importance over different time prediction horizons. The number of features used to train the LSTM model was reduced from 25 features to 5 features, resulting in improved accuracy as measured by Mean Absolute Error (MAE). Parameter sensitivity analysis identified look-back nodes associated with the RNN proved to be a significant source of error if not aligned with the prediction horizon. Overall, MAE's less than 2 were calculated for predictions out to 72 hours. Novel deep learning techniques were used to train an 8-hour averaged ozone forecast model. Missing data and outliers within the captured data set were replaced using a new imputation method that generated calculated values closer to the expected value based on the time and season. Decision trees were used to identify input variables with the greatest importance. The methods presented in this paper allow air managers to forecast long range air pollution

  7. Universal Linear Fit Identification: A Method Independent of Data, Outliers and Noise Distribution Model and Free of Missing or Removed Data Imputation

    Science.gov (United States)

    Adikaram, K. K. L. B.; Becker, T.

    2015-01-01

    Data processing requires a robust linear fit identification method. In this paper, we introduce a non-parametric robust linear fit identification method for time series. The method uses an indicator 2/n to identify linear fit, where n is number of terms in a series. The ratio R max of a max − a min and S n − a min *n and that of R min of a max − a min and a max *n − S n are always equal to 2/n, where a max is the maximum element, a min is the minimum element and S n is the sum of all elements. If any series expected to follow y = c consists of data that do not agree with y = c form, R max > 2/n and R min > 2/n imply that the maximum and minimum elements, respectively, do not agree with linear fit. We define threshold values for outliers and noise detection as 2/n * (1 + k 1 ) and 2/n * (1 + k 2 ), respectively, where k 1 > k 2 and 0 ≤ k 1 ≤ n/2 − 1. Given this relation and transformation technique, which transforms data into the form y = c, we show that removing all data that do not agree with linear fit is possible. Furthermore, the method is independent of the number of data points, missing data, removed data points and nature of distribution (Gaussian or non-Gaussian) of outliers, noise and clean data. These are major advantages over the existing linear fit methods. Since having a perfect linear relation between two variables in the real world is impossible, we used artificial data sets with extreme conditions to verify the method. The method detects the correct linear fit when the percentage of data agreeing with linear fit is less than 50%, and the deviation of data that do not agree with linear fit is very small, of the order of ±10−4%. The method results in incorrect detections only when numerical accuracy is insufficient in the calculation process. PMID:26571035

  8. Time perspective and well-being: Swedish survey questionnaires and data.

    Science.gov (United States)

    Garcia, Danilo; Nima, Ali Al; Lindskär, Erik

    2016-12-01

    The data pertains 448 Swedes' responses to questionnaires on time perspective (Zimbardo Time Perspective Inventory), temporal life satisfaction (Temporal Satisfaction with Life Scale), affect (Positive Affect and Negative Affect Schedule), and psychological well-being (Ryff׳s Scales of Psychological Well-Being-short version). The data was collected among university students and individuals at a training facility (see U. Sailer, P. Rosenberg, A.A. Nima, A. Gamble, T. Gärling, T. Archer, D. Garcia, 2014; [1]). Since there were no differences in any of the other background variables, but exercise frequency, all subsequent analyses were conducted on the 448 participants as one single sample. In this article we include the Swedish versions of the questionnaires used to operationalize the time perspective and well-being variables. The data is available, SPSS file, as Supplementary material in this article. We used the Expectation-Maximization Algorithm to input missing values. Little׳s Chi-Square test for Missing Completely at Random showed a χ (2)=67.25 (df=53, p=.09) for men and χ (2)=77.65 (df=72, p=.31) for women. These values suggested that the Expectation-Maximization Algorithm was suitable to use on this data for missing data imputation.

  9. Big Data in Market Research: Why More Data Does Not Automatically Mean Better Information

    Directory of Open Access Journals (Sweden)

    Bosch Volker

    2016-11-01

    Full Text Available Big data will change market research at its core in the long term because consumption of products and media can be logged electronically more and more, making it measurable on a large scale. Unfortunately, big data datasets are rarely representative, even if they are huge. Smart algorithms are needed to achieve high precision and prediction quality for digital and non-representative approaches. Also, big data can only be processed with complex and therefore error-prone software, which leads to measurement errors that need to be corrected. Another challenge is posed by missing but critical variables. The amount of data can indeed be overwhelming, but it often lacks important information. The missing observations can only be filled in by using statistical data imputation. This requires an additional data source with the additional variables, for example a panel. Linear imputation is a statistical procedure that is anything but trivial. It is an instrument to “transport information,” and the higher the observed data correlates with the data to be imputed, the better it works. It makes structures visible even if the depth of the data is limited.

  10. Missed hormonal contraceptives: new recommendations.

    Science.gov (United States)

    Guilbert, Edith; Black, Amanda; Dunn, Sheila; Senikas, Vyta

    2008-11-01

    To provide evidence-based guidance for women and their health care providers on the management of missed or delayed hormonal contraceptive doses in order to prevent unintended pregnancy. Medline, PubMed, and the Cochrane Database were searched for articles published in English, from 1974 to 2007, about hormonal contraceptive methods that are available in Canada and that may be missed or delayed. Relevant publications and position papers from appropriate reproductive health and family planning organizations were also reviewed. The quality of evidence is rated using the criteria developed by the Canadian Task Force on Preventive Health Care. This committee opinion will help health care providers offer clear information to women who have not been adherent in using hormonal contraception with the purpose of preventing unintended pregnancy. The Society of Obstetricians and Gynaecologists of Canada. SUMMARY STATEMENTS: 1. Instructions for what women should do when they miss hormonal contraception have been complex and women do not understand them correctly. (I) 2. The highest risk of ovulation occurs when the hormone-free interval is prolonged for more than seven days, either by delaying the start of combined hormonal contraceptives or by missing active hormone doses during the first or third weeks of combined oral contraceptives. (II) Ovulation rarely occurs after seven consecutive days of combined oral contraceptive use. (II) RECOMMENDATIONS: 1. Health care providers should give clear, simple instructions, both written and oral, on missed hormonal contraceptive pills as part of contraceptive counselling. (III-A) 2. Health care providers should provide women with telephone/electronic resources for reference in the event of missed or delayed hormonal contraceptives. (III-A) 3. In order to avoid an increased risk of unintended pregnancy, the hormone-free interval should not exceed seven days in combined hormonal contraceptive users. (II-A) 4. Back-up contraception should

  11. The prevention and handling of the missing data

    OpenAIRE

    Kang, Hyun

    2013-01-01

    Even in a well-designed and controlled study, missing data occurs in almost all research. Missing data can reduce the statistical power of a study and can produce biased estimates, leading to invalid conclusions. This manuscript reviews the problems and types of missing data, along with the techniques for handling missing data. The mechanisms by which missing data occurs are illustrated, and the methods for handling the missing data are discussed. The paper concludes with recommendations for ...

  12. Dental health state utility values associated with tooth loss in two contrasting cultures.

    Science.gov (United States)

    Nassani, M Z; Locker, D; Elmesallati, A A; Devlin, H; Mohammadi, T M; Hajizamani, A; Kay, E J

    2009-08-01

    The study aimed to assess the value placed on oral health states by measuring the utility of mouths in which teeth had been lost and to explore variations in utility values within and between two contrasting cultures, UK and Iran. One hundred and fifty eight patients, 84 from UK and 74 from Iran, were recruited from clinics at University-based faculties of dentistry. All had experienced tooth loss and had restored or unrestored dental spaces. They were presented with 19 different scenarios of mouths with missing teeth. Fourteen involved the loss of one tooth and five involved shortened dental arches (SDAs) with varying numbers of missing posterior teeth. Each written description was accompanied by a verbal explanation and digital pictures of mouth models. Participants were asked to indicate on a standardized Visual Analogue Scale how they would value the health of their mouth if they had lost the tooth/teeth described and the resulting space was left unrestored. With a utility value of 0.0 representing the worst possible health state for a mouth and 1.0 representing the best, the mouth with the upper central incisor missing attracted the lowest utility value in both samples (UK = 0.16; Iran = 0.06), while the one with a missing upper second molar the highest utility values (0.42, 0.39 respectively). In both countries the utility value increased as the tooth in the scenario moved from the anterior towards the posterior aspect of the mouth. There were significant differences in utility values between UK and Iranian samples for four scenarios all involving the loss of anterior teeth. These differences remained after controlling for gender, age and the state of the dentition. With respect to the SDA scenarios, a mouth with a SDA with only the second molar teeth missing in all quadrants attracted the highest utility values, while a mouth with an extreme SDA with both missing molar and premolar teeth in all quadrants attracted the lowest utility values. The study

  13. Astronaut Preflight Cardiovascular Variables Associated with Vascular Compliance are Highly Correlated with Post-Flight Eye Outcome Measures in the Visual Impairment Intracranial Pressure (VIIP) Syndrome Following Long Duration Spaceflight

    Science.gov (United States)

    Otto, Christian; Ploutz-Snyder, R.

    2015-01-01

    The detection of the first VIIP case occurred in 2005, and adequate eye outcome measures were available for 31 (67.4%) of the 46 long duration US crewmembers who had flown on the ISS since its first crewed mission in 2000. Therefore, this analysis is limited to a subgroup (22 males and 9 females). A "cardiovascular profile" for each astronaut was compiled by examining twelve individual parameters; eleven of these were preflight variables: systolic blood pressure, pulse pressure, body mass index, percentage body fat, LDL, HDL, triglycerides, use of anti-lipid medication, fasting serum glucose, and maximal oxygen uptake in ml/kg. Each of these variables was averaged across three preflight annual physical exams. Astronaut age prior to the long duration mission, and inflight salt intake was also included in the analysis. The group of cardiovascular variables for each crew member was compared with seven VIIP eye outcome variables collected during the immediate post-flight period: anterior-posterior axial length of the globe measured by ultrasound and optical biometry; optic nerve sheath diameter, optic nerve diameter, and optic nerve to sheath ratio- each measured by ultrasound and magnetic resonance imaging (MRI), intraocular pressure (IOP), change in manifest refraction, mean retinal nerve fiber layer (RNFL) on optical coherence tomography (OCT), and RNFL of the inferior and superior retinal quadrants. Since most of the VIIP eye outcome measures were added sequentially beginning in 2005, as knowledge of the syndrome improved, data were unavailable for 22.0% of the outcome measurements. To address the missing data, we employed multivariate multiple imputation techniques with predictive mean matching methods to accumulate 200 separate imputed datasets for analysis. We were able to impute data for the 22.0% of missing VIIP eye outcomes. We then applied Rubin's rules for collapsing the statistical results across our 200 multiply imputed data sets to assess the canonical

  14. Tolerance to missing data using a likelihood ratio based classifier for computer-aided classification of breast cancer

    International Nuclear Information System (INIS)

    Bilska-Wolak, Anna O; Floyd, Carey E Jr

    2004-01-01

    While mammography is a highly sensitive method for detecting breast tumours, its ability to differentiate between malignant and benign lesions is low, which may result in as many as 70% of unnecessary biopsies. The purpose of this study was to develop a highly specific computer-aided diagnosis algorithm to improve classification of mammographic masses. A classifier based on the likelihood ratio was developed to accommodate cases with missing data. Data for development included 671 biopsy cases (245 malignant), with biopsy-proved outcome. Sixteen features based on the BI-RADS TM lexicon and patient history had been recorded for the cases, with 1.3 ± 1.1 missing feature values per case. Classifier evaluation methods included receiver operating characteristic and leave-one-out bootstrap sampling. The classifier achieved 32% specificity at 100% sensitivity on the 671 cases with 16 features that had missing values. Utilizing just the seven features present for all cases resulted in decreased performance at 100% sensitivity with average 19% specificity. No cases and no feature data were omitted during classifier development, showing that it is more beneficial to utilize cases with missing values than to discard incomplete cases that cannot be handled by many algorithms. Classification of mammographic masses was commendable at high sensitivity levels, indicating that benign cases could be potentially spared from biopsy

  15. Value and depreciation of mineral resources over the very long run: An empirical contrast of different methods

    OpenAIRE

    Rubio Varas, M. del Mar

    2005-01-01

    The paper contrasts empirically the results of alternative methods for estimating the value and the depreciation of mineral resources. The historical data of Mexico and Venezuela, covering the period 1920s-1980s, is used to contrast the results of several methods. These are the present value, the net price method, the user cost method and the imputed income method. The paper establishes that the net price and the user cost are not competing methods as such, but alternative adjustments to diff...

  16. 21 CFR 1404.630 - May the Office of National Drug Control Policy impute conduct of one person to another?

    Science.gov (United States)

    2010-04-01

    ... 21 Food and Drugs 9 2010-04-01 2010-04-01 false May the Office of National Drug Control Policy impute conduct of one person to another? 1404.630 Section 1404.630 Food and Drugs OFFICE OF NATIONAL DRUG CONTROL POLICY GOVERNMENTWIDE DEBARMENT AND SUSPENSION (NONPROCUREMENT) General Principles Relating to Suspension and Debarment Actions § 1404.630...

  17. The missed diagnosis

    International Nuclear Information System (INIS)

    Bundy, A.L.

    1988-01-01

    One of the questions that haunts the radiologist as he shuffles through piles of films is ''What am I missing?'' This same question takes on even more meaning when the radiologist is pressed for time, when he reluctantly checks the night work of the resident, when the patient left before more or better films could be obtained; or when the radiologist is involved in a subspecialty in which he is not properly trained. According to Dr. Berlin's survey, the missed diagnosis category accounted for the largest number of radiology malpractice cases. We all know that many diagnoses are more easily made using the ''retrospectoscope.'' But is the plaintiff attorney also adept at using this instrument? Just how knowledgeable must the radiologist be in the use of the ''prospectoscope''? A familiarity with cases that have already been tried should at least alert radiologists to the chances of their own involvement in litigation. While the missed diagnosis is by no means peculiar to the radiologist, it is one of the principal reasons that he may find himself in court

  18. "Nudge" and the epidemic of missed appointments.

    Science.gov (United States)

    Aggarwal, Ajay; Davies, Joanna; Sullivan, Richard

    2016-06-20

    Purpose - Missed appointments constitute a significant problem in the UK National Health Service (NHS) and this remains an area where improvements could yield substantial efficiency savings. The purpose of this paper is to suggest that nudge policies based on behavioural theories may help target interventions to improve patient motivation to attend appointments. Design/methodology/approach - The authors propose two policies to reduce missed appointments. The first attempts to empower patients through making the appointment system more individualised to them and utilising their intrinsic feelings of social responsibility. The second policy utilises a financial commitment given by the patient at the time of booking. The different mechanisms of influencing patient behaviour are based on two different views of what motivates individuals' actions. The first policy is based on individuals being "knights". They are altruistic and have well-intentioned values. The second policy option is constructed on the premise that an individual is governed by self-interest, and they are in fact "knaves". Findings - A policy, which avoids the use of financial penalties is likely to be more culturally acceptable within the NHS. It could also prevent the phenomenon of "crowding out" whereby the desire to act dutifully gets displaced by the motivation to avoid incurring a monetary fine. Originality/value - Testing both strategies would provide insight into patient attitudes towards health care and society. This would help optimise behavioural strategies which may influence not only appointment attendances but also have wider implications for encouraging rational health care consumption.

  19. Inconsistent definitions for intention-to-treat in relation to missing outcome data: systematic review of the methods literature.

    Science.gov (United States)

    Alshurafa, Mohamad; Briel, Matthias; Akl, Elie A; Haines, Ted; Moayyedi, Paul; Gentles, Stephen J; Rios, Lorena; Tran, Chau; Bhatnagar, Neera; Lamontagne, Francois; Walter, Stephen D; Guyatt, Gordon H

    2012-01-01

    Authors of randomized trial reports seem to hold a variety of views regarding the relationship between missing outcome data (MOD) and intention to treat (ITT). The objectives of this study were to systematically investigate how authors of methodology articles define ITT in the presence of MOD, how they recommend handling MOD under ITT, and to make a proposal for potential improvement in the definition and use of ITT in relation to MOD. We systematically searched MEDLINE in February 2009 for methodological articles written in English that devoted at least one paragraph to ITT and two other paragraphs to either ITT or MOD. We excluded original trial reports, observational studies, and clinical systematic reviews. Working in teams of two, we independently extracted relevant information from each eligible article. Of 1007 titles and abstracts reviewed, 66 articles met eligibility criteria. Five (8%) did not provide a definition of ITT; 25 (38%) mentioned MOD but did not discuss its relationship to ITT; and 36 (55%) discussed the relationship of MOD with ITT. These 36 articles described one or more of three statements: complete follow-up is required for ITT (58%); ITT and MOD are separate issues (17%); and ITT requires a specific strategy for handling MOD (78%); 17 (47%) endorsed more than one relationship. The most frequently mentioned strategies for handling MOD within ITT were: using the last outcome carried forward (50%); sensitivity analysis (50%); and use of available data to impute missing data (46%). We found that there is no consensus on the definition of ITT in relation to MOD. For conceptual clarity, we suggest that both reports of randomized trials and systematic reviews separately consider and describe how they deal with participants with complete data and those with MOD.

  20. Candidate gene analysis using imputed genotypes: cell cycle single-nucleotide polymorphisms and ovarian cancer risk

    DEFF Research Database (Denmark)

    Goode, Ellen L; Fridley, Brooke L; Vierkant, Robert A

    2009-01-01

    Polymorphisms in genes critical to cell cycle control are outstanding candidates for association with ovarian cancer risk; numerous genes have been interrogated by multiple research groups using differing tagging single-nucleotide polymorphism (SNP) sets. To maximize information gleaned from......, and rs3212891; CDK2 rs2069391, rs2069414, and rs17528736; and CCNE1 rs3218036. These results exemplify the utility of imputation in candidate gene studies and lend evidence to a role of cell cycle genes in ovarian cancer etiology, suggest a reduced set of SNPs to target in additional cases and controls....

  1. Incomplete Early Childhood Immunization Series and Missing Fourth DTaP Immunizations; Missed Opportunities or Missed Visits?

    Science.gov (United States)

    Robison, Steve G

    2013-01-01

    The successful completion of early childhood immunizations is a proxy for overall quality of early care. Immunization statuses are usually assessed by up-to-date (UTD) rates covering combined series of different immunizations. However, series UTD rates often only bear on which single immunization is missing, rather than the success of all immunizations. In the US, most series UTD rates are limited by missing fourth DTaP-containing immunizations (diphtheria/tetanus/pertussis) due at 15 to 18 months of age. Missing 4th DTaP immunizations are associated either with a lack of visits at 15 to 18 months of age, or to visits without immunizations. Typical immunization data however cannot distinguish between these two reasons. This study compared immunization records from the Oregon ALERT IIS with medical encounter records for two-year olds in the Oregon Health Plan. Among those with 3 valid DTaPs by 9 months of age, 31.6% failed to receive a timely 4th DTaP; of those without a 4th DTaP, 42.1% did not have any provider visits from 15 through 18 months of age, while 57.9% had at least one provider visit. Those with a 4th DTaP averaged 2.45 encounters, while those with encounters but without 4th DTaPs averaged 2.23 encounters.

  2. STUDY ON MATERNAL MORTALITY AND NEAR MISS CASE

    Directory of Open Access Journals (Sweden)

    Ritanjali Behera

    2017-12-01

    Full Text Available BACKGROUND Maternal mortality traditionally has been the indicator of maternal health. More recently the review of cases of near miss obstetric event is found to be useful to investigate maternal mortality. Cases of near miss are those, where a woman nearly died but survived a complication that occur during pregnancy or child birth. Aim and Objective 1. To analyse near miss cases and maternal deaths. 2. To determine maternal near miss indicator and to analyse the cause and contributing factors for both of them. MATERIALS AND METHODS This prospective observational study conducted in M.K.C.G. medical college, Berhampur from 1st October 2015 to 30th September 2017. All the cases of maternal deaths and near miss cases defined by WHO criteria are taken. Information regarding demographic profile and reproductive parameters are collected and results are analysed using percentage and proportion. RESULTS Out of 17977 deliveries 201 were near miss cases and 116 were maternal deaths. MMR was 681, near miss incidence 1.18, maternal death to near miss ratio was 1:1.73. Hypertensive disorder of pregnancy (37.4% was the leading cause followed by haemorrhage (17.4%. For near miss cases 101 cases fulfilled clinical criteria, 61 laboratory criteria and 131 cases management based criteria. CONCLUSION Hypertensive disorder of pregnancy and haemorrhage are the leading cause of maternal death and for near miss cases most common organ system involved was cardiovascular system. All the near miss cases should be interpreted as opportunities to improve the health care services.

  3. The missing mass of the universe

    International Nuclear Information System (INIS)

    Lachieze-Rey, M.

    2002-01-01

    The existence of dark matter is suggested by 2 facts: 1) the real mass of matter inside galaxies must be 10 times greater than the observed mass to explain the values of the spinning velocity of galaxies around their centers. Furthermore the value of this velocity does not depend a lot on the distance to the center of the galaxy, this implies that the missing mass is uniformly distributed inside galaxies; 2) According to general relativity, massive celestial bodies produce a curvature of space-time that generates a deviation of light beams. These deviations have been studied and it appears that they require the presence of a far more important quantity of matter than the quantity reduced to visible matter. The missing mass issue arises 3 problems, the first problem comes from the existence of a great part of ordinary (baryonic) matter that is invisible: the global mass of stars represents only 10 % of the total baryonic mass of the universe. This invisible ordinary matter might exist in condensed form in black-hole, giant planets or brown dwarfs roaming the galaxies. The second problem arose when most scientists were convinced of the existence of huge quantity of non-baryonic matter, 10 times more abundant than the baryonic matter. The supersymmetric extension of the standard model allows the existence of particles that might be candidate for carrying this non-baryonic mass. The third problem appeared recently when measurement of the curvature of the space-time has shown that the 3 forms of matter: visible matter, invisible ordinary matter and non-baryonic matter contribute together to only one third of the total energy of the universe. (A.C.)

  4. Polynomials, Riemann surfaces, and reconstructing missing-energy events

    CERN Document Server

    Gripaios, Ben; Webber, Bryan

    2011-01-01

    We consider the problem of reconstructing energies, momenta, and masses in collider events with missing energy, along with the complications introduced by combinatorial ambiguities and measurement errors. Typically, one reconstructs more than one value and we show how the wrong values may be correlated with the right ones. The problem has a natural formulation in terms of the theory of Riemann surfaces. We discuss examples including top quark decays in the Standard Model (relevant for top quark mass measurements and tests of spin correlation), cascade decays in models of new physics containing dark matter candidates, decays of third-generation leptoquarks in composite models of electroweak symmetry breaking, and Higgs boson decay into two tau leptons.

  5. Multiple Imputation of Groundwater Data to Evaluate Spatial and Temporal Anthropogenic Influences on Subsurface Water Fluxes in Los Angeles, CA

    Science.gov (United States)

    Manago, K. F.; Hogue, T. S.; Hering, A. S.

    2014-12-01

    In the City of Los Angeles, groundwater accounts for 11% of the total water supply on average, and 30% during drought years. Due to ongoing drought in California, increased reliance on local water supply highlights the need for better understanding of regional groundwater dynamics and estimating sustainable groundwater supply. However, in an urban setting, such as Los Angeles, understanding or modeling groundwater levels is extremely complicated due to various anthropogenic influences such as groundwater pumping, artificial recharge, landscape irrigation, leaking infrastructure, seawater intrusion, and extensive impervious surfaces. This study analyzes anthropogenic effects on groundwater levels using groundwater monitoring well data from the County of Los Angeles Department of Public Works. The groundwater data is irregularly sampled with large gaps between samples, resulting in a sparsely populated dataset. A multiple imputation method is used to fill the missing data, allowing for multiple ensembles and improved error estimates. The filled data is interpolated to create spatial groundwater maps utilizing information from all wells. The groundwater data is evaluated at a monthly time step over the last several decades to analyze the effect of land cover and identify other influencing factors on groundwater levels spatially and temporally. Preliminary results show irrigated parks have the largest influence on groundwater fluctuations, resulting in large seasonal changes, exceeding changes in spreading grounds. It is assumed that these fluctuations are caused by watering practices required to sustain non-native vegetation. Conversely, high intensity urbanized areas resulted in muted groundwater fluctuations and behavior decoupling from climate patterns. Results provides improved understanding of anthropogenic effects on groundwater levels in addition to providing high quality datasets for validation of regional groundwater models.

  6. Missing transverse energy performance of the CMS detector

    Energy Technology Data Exchange (ETDEWEB)

    Chatrchyan, Serguei [Yerevan Physics Inst. (Armenia); et al.

    2011-09-01

    During 2010 the LHC delivered pp collisions with a centre-of-mass energy of 7 TeV. In this paper, the results of comprehensive studies of missing transverse energy as measured by the CMS detector are presented. The results cover the measurements of the scale and resolution for missing transverse energy, and the effects of multiple pp interactions within the same bunch crossings on the scale and resolution. Anomalous measurements of missing transverse energy are studied, and algorithms for their identification are described. The performances of several reconstruction algorithms for calculating missing transverse energy are compared. An algorithm, called missing-transverse-energy significance, which estimates the compatibility of the reconstructed missing transverse energy with zero, is described, and its performance is demonstrated.

  7. Missing transverse energy performance of the CMS detector

    International Nuclear Information System (INIS)

    2011-01-01

    During 2010 the LHC delivered pp collisions with a centre-of-mass energy of 7 TeV. In this paper, the results of comprehensive studies of missing transverse energy as measured by the CMS detector are presented. The results cover the measurements of the scale and resolution for missing transverse energy, and the effects of multiple pp interactions within the same bunch crossings on the scale and resolution. Anomalous measurements of missing transverse energy are studied, and algorithms for their identification are described. The performance of several reconstruction algorithms for calculating missing transverse energy are compared. An algorithm, called missing-transverse-energy significance, which estimates the compatibility of the reconstructed missing transverse energy with zero, is described, and its performance is demonstrated.

  8. Servizi finanziari imputati e interdipendenze settoriali: un'analisi settoriale del ruolo del credito nel sistema economico. (Imputed bank services and sectoral interdependences: a structural analysis of the role of credit in the economy

    Directory of Open Access Journals (Sweden)

    C. BIANCHI

    2013-12-01

    Full Text Available I sistemi di contabilità nazionale in base alla metodologia SEC sono soliti comportarsi in modo da garantire l'impossibilità pratica di effettuare qualsiasi assegnazione significativa di servizi bancari imputati tra i singoli rami di attività economica.  Il presente lavoro mostra come questo vieta l'analisi strutturale del ruolo del credito nel sistema di interdipendenze. . L'analisi mette in evidenza la duplice natura del credito come contenuti a valore aggiunto altamente intermedio e alto. È in grado di influenzare forte su i costi di produzione degli altri rami, senza essere influenzato da loro. Queste proprietà conferiscono al settore bancario un potenziale molto elevato per l'inflazione.National accounts systems based on the SEC methodology are usually thought to comport the practical impossibility of carrying out any meaningful allocation of imputed bank services among the single branches of economic activity. As a consequence, the total value of the net interest earned by the credit system as a whole is considered as a cost entry and a negative component of added value in an ad-hoc additional industry, to be aggregated to the main credit one in the typical input-output analysis. The present work shows how this prohibits the structural analysis of the role of credit in the system of interdependencies. A method of is proposed in which imputed services of credit are distributed by branches, on the basis of existing statistics, proving valuable in assessing the significance of certain quantities of national accounts, such as operating results. The analysis highlights the dual nature of credit as highly intermediate and high value-added content. It is able to strongly influence the production costs of the other branches, without being influenced by them. These properties give the banking industry a very high potential for inflation. JEL: E51, G21

  9. Reducing Misses and Near Misses Related to Multitasking on the Electronic Health Record: Observational Study and Qualitative Analysis.

    Science.gov (United States)

    Ratanawongsa, Neda; Matta, George Y; Bohsali, Fuad B; Chisolm, Margaret S

    2018-02-06

    Clinicians' use of electronic health record (EHR) systems while multitasking may increase the risk of making errors, but silent EHR system use may lower patient satisfaction. Delaying EHR system use until after patient visits may increase clinicians' EHR workload, stress, and burnout. We aimed to describe the perspectives of clinicians, educators, administrators, and researchers about misses and near misses that they felt were related to clinician multitasking while using EHR systems. This observational study was a thematic analysis of perspectives elicited from 63 continuing medical education (CME) participants during 2 workshops and 1 interactive lecture about challenges and strategies for relationship-centered communication during clinician EHR system use. The workshop elicited reflection about memorable times when multitasking EHR use was associated with "misses" (errors that were not caught at the time) or "near misses" (mistakes that were caught before leading to errors). We conducted qualitative analysis using an editing analysis style to identify codes and then select representative themes and quotes. All workshop participants shared stories of misses or near misses in EHR system ordering and documentation or patient-clinician communication, wondering about "misses we don't even know about." Risk factors included the computer's position, EHR system usability, note content and style, information overload, problematic workflows, systems issues, and provider and patient communication behaviors and expectations. Strategies to reduce multitasking EHR system misses included clinician transparency when needing silent EHR system use (eg, for prescribing), narrating EHR system use, patient activation during EHR system use, adapting visit organization and workflow, improving EHR system design, and improving team support and systems. CME participants shared numerous stories of errors and near misses in EHR tasks and communication that they felt related to EHR

  10. An interference account of the missing-VP effect

    Directory of Open Access Journals (Sweden)

    Markus eBader

    2015-06-01

    Full Text Available Sentences with doubly center-embedded relative clauses in which a verb phrase (VP is missing are sometimes perceived as grammatical, thus giving rise to an illusion of grammaticality. In this paper, we provide a new account of why missing-VP sentences, which are both complex and ungrammatical, lead to an illusion of grammaticality, the so-called missing-VP effect. We propose that the missing-VP effect in particular, and processing difficulties with multiply center-embedded clauses more generally, are best understood as resulting from interference during cue-based retrieval. When processing a sentence with double center-embedding, a retrieval error due to interference can cause the verb of an embedded clause to be erroneously attached into a higher clause. This can lead to an illusion of grammaticality in the case of missing-VP sentences and to processing complexity in the case of complete sentences with double center-embedding. Evidence for an interference account of the missing-VP effect comes from experiments that have investigated the missing-VP effect in German using a speeded grammaticality judgments procedure. We review this evidence and then present two new experiments that show that the missing VP effect can be found in German also with less restricting procedures. One experiment was a questionnaire study which required grammaticality judgments from participants but without imposing any time constraints. The second experiment used a self-paced reading procedure and did not require any judgments. Both experiments confirm the prior findings of missing-VP effects in German and also show that the missing-VP effect is subject to a primacy effect as known from the memory literature. Based on this evidence, we argue that an account of missing-VP effects in terms of interference during cue-based retrieval is superior to accounts in terms of limited memory resources or in terms of experience with embedded structures.

  11. Annual Coded Wire Tag Program; Oregon Missing Production Groups, 1994 Annual Report.

    Energy Technology Data Exchange (ETDEWEB)

    Garrison, Robert L.; Isaac, Dennis L.; Lewis, Mark A.

    1994-12-01

    The goal of this program is to develop the ability to estimate hatchery production survival values and evaluate effectiveness of Oregon hatcheries. To accomplish this goal. We are tagging missing production groups within hatcheries to assure each production group is identifiable to allow future evaluation upon recovery of tag data.

  12. Cox regression with missing covariate data using a modified partial likelihood method

    DEFF Research Database (Denmark)

    Martinussen, Torben; Holst, Klaus K.; Scheike, Thomas H.

    2016-01-01

    Missing covariate values is a common problem in survival analysis. In this paper we propose a novel method for the Cox regression model that is close to maximum likelihood but avoids the use of the EM-algorithm. It exploits that the observed hazard function is multiplicative in the baseline hazard...

  13. Dynamical Systems Analysis Applied to Working Memory Data

    Directory of Open Access Journals (Sweden)

    Fidan eGasimova

    2014-07-01

    Full Text Available In the present paper we investigate weekly fluctuations in the working memory capacity (WMC assessed over a period of two years. We use dynamical system analysis, specifically a second order linear differential equation, to model weekly variability in WMC in a sample of 112 9th graders. In our longitudinal data we use a B-spline imputation method to deal with missing data. The results show a significant negative frequency parameter in the data, indicating a cyclical pattern in weekly memory updating performance across time. We use a multilevel modeling approach to capture individual differences in model parameters and find that a higher initial performance level and a slower improvement at the MU task is associated with a slower frequency of oscillation. Additionally, we conduct a simulation study examining the analysis procedure’s performance using different numbers of B-spline knots and values of time delay embedding dimensions. Results show that the number of knots in the B-spline imputation influence accuracy more than the number of embedding dimensions.

  14. Breast carcinomas: why are they missed?

    Science.gov (United States)

    Muttarak, M; Pojchamarnwiputh, S; Chaiwun, B

    2006-10-01

    Mammography has proven to be an effective modality for the detection of early breast carcinoma. However, 4-34 percent of breast cancers may be missed at mammography. Delayed diagnosis of breast carcinoma results in an unfavourable prognosis. The objective of this study was to determine the causes and characteristics of breast carcinomas missed by mammography at our institution, with the aim of reducing the rate of missed carcinoma. We reviewed the reports of 13,191 mammograms performed over a five-year period. Breast Imaging Reporting and Data Systems (BI-RADS) were used for the mammographical assessment, and reports were cross-referenced with the histological diagnosis of breast carcinoma. Causes of missed carcinomas were classified. Of 344 patients who had breast carcinoma and had mammograms done prior to surgery, 18 (5.2 percent) failed to be diagnosed by mammography. Of these, five were caused by dense breast parenchyma obscuring the lesions, 11 were due to perception and interpretation errors, and one each from unusual lesion characteristics and poor positioning. Several factors, including dense breast parenchyma obscuring a lesion, perception error, interpretation error, unusual lesion characteristics, and poor technique or positioning, are possible causes of missed breast cancers.

  15. Handling missing data in ranked set sampling

    CERN Document Server

    Bouza-Herrera, Carlos N

    2013-01-01

    The existence of missing observations is a very important aspect to be considered in the application of survey sampling, for example. In human populations they may be caused by a refusal of some interviewees to give the true value for the variable of interest. Traditionally, simple random sampling is used to select samples. Most statistical models are supported by the use of samples selected by means of this design. In recent decades, an alternative design has started being used, which, in many cases, shows an improvement in terms of accuracy compared with traditional sampling. It is called R

  16. Estimating past hepatitis C infection risk from reported risk factor histories: implications for imputing age of infection and modeling fibrosis progression

    Directory of Open Access Journals (Sweden)

    Busch Michael P

    2007-12-01

    Full Text Available Abstract Background Chronic hepatitis C virus infection is prevalent and often causes hepatic fibrosis, which can progress to cirrhosis and cause liver cancer or liver failure. Study of fibrosis progression often relies on imputing the time of infection, often as the reported age of first injection drug use. We sought to examine the accuracy of such imputation and implications for modeling factors that influence progression rates. Methods We analyzed cross-sectional data on hepatitis C antibody status and reported risk factor histories from two large studies, the Women's Interagency HIV Study and the Urban Health Study, using modern survival analysis methods for current status data to model past infection risk year by year. We compared fitted distributions of past infection risk to reported age of first injection drug use. Results Although injection drug use appeared to be a very strong risk factor, models for both studies showed that many subjects had considerable probability of having been infected substantially before or after their reported age of first injection drug use. Persons reporting younger age of first injection drug use were more likely to have been infected after, and persons reporting older age of first injection drug use were more likely to have been infected before. Conclusion In cross-sectional studies of fibrosis progression where date of HCV infection is estimated from risk factor histories, modern methods such as multiple imputation should be used to account for the substantial uncertainty about when infection occurred. The models presented here can provide the inputs needed by such methods. Using reported age of first injection drug use as the time of infection in studies of fibrosis progression is likely to produce a spuriously strong association of younger age of infection with slower rate of progression.

  17. A Two-Step Approach for Analysis of Nonignorable Missing Outcomes in Longitudinal Regression: an Application to Upstate KIDS Study.

    Science.gov (United States)

    Liu, Danping; Yeung, Edwina H; McLain, Alexander C; Xie, Yunlong; Buck Louis, Germaine M; Sundaram, Rajeshwari

    2017-09-01

    Imperfect follow-up in longitudinal studies commonly leads to missing outcome data that can potentially bias the inference when the missingness is nonignorable; that is, the propensity of missingness depends on missing values in the data. In the Upstate KIDS Study, we seek to determine if the missingness of child development outcomes is nonignorable, and how a simple model assuming ignorable missingness would compare with more complicated models for a nonignorable mechanism. To correct for nonignorable missingness, the shared random effects model (SREM) jointly models the outcome and the missing mechanism. However, the computational complexity and lack of software packages has limited its practical applications. This paper proposes a novel two-step approach to handle nonignorable missing outcomes in generalized linear mixed models. We first analyse the missing mechanism with a generalized linear mixed model and predict values of the random effects; then, the outcome model is fitted adjusting for the predicted random effects to account for heterogeneity in the missingness propensity. Extensive simulation studies suggest that the proposed method is a reliable approximation to SREM, with a much faster computation. The nonignorability of missing data in the Upstate KIDS Study is estimated to be mild to moderate, and the analyses using the two-step approach or SREM are similar to the model assuming ignorable missingness. The two-step approach is a computationally straightforward method that can be conducted as sensitivity analyses in longitudinal studies to examine violations to the ignorable missingness assumption and the implications relative to health outcomes. © 2017 John Wiley & Sons Ltd.

  18. Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes

    KAUST Repository

    Chatterjee, Nilanjan; Chen, Yi-Hau; Luo, Sheng; Carroll, Raymond J.

    2009-01-01

    Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.

  19. Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes

    KAUST Repository

    Chatterjee, Nilanjan

    2009-11-01

    Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.

  20. Planned Missing Data Designs in Educational Psychology Research

    NARCIS (Netherlands)

    Rhemtulla, M.; Hancock, G.R.

    2016-01-01

    Although missing data are often viewed as a challenge for applied researchers, in fact missing data can be highly beneficial. Specifically, when the amount of missing data on specific variables is carefully controlled, a balance can be struck between statistical power and research costs. This

  1. Creating Web Sites The Missing Manual

    CERN Document Server

    MacDonald, Matthew

    2006-01-01

    Think you have to be a technical wizard to build a great web site? Think again. For anyone who wants to create an engaging web site--for either personal or business purposes--Creating Web Sites: The Missing Manual demystifies the process and provides tools, techniques, and expert guidance for developing a professional and reliable web presence. Like every Missing Manual, you can count on Creating Web Sites: The Missing Manual to be entertaining and insightful and complete with all the vital information, clear-headed advice, and detailed instructions you need to master the task at hand. Autho

  2. Scalable Tensor Factorizations with Missing Data

    DEFF Research Database (Denmark)

    Acar, Evrim; Dunlavy, Daniel M.; Kolda, Tamara G.

    2010-01-01

    of missing data, many important data sets will be discarded or improperly analyzed. Therefore, we need a robust and scalable approach for factorizing multi-way arrays (i.e., tensors) in the presence of missing data. We focus on one of the most well-known tensor factorizations, CANDECOMP/PARAFAC (CP...... is shown to successfully factor tensors with noise and up to 70% missing data. Moreover, our approach is significantly faster than the leading alternative and scales to larger problems. To show the real-world usefulness of CP-WOPT, we illustrate its applicability on a novel EEG (electroencephalogram...

  3. A Stochastic Method to Manage Delay and Missing Values for In-Situ Sensors in an Alternating Activated Sludge Process

    DEFF Research Database (Denmark)

    Stentoft, Peter Alexander; Munk-Nielsen, Thomas; Mikkelsen, Peter Steen

    2017-01-01

    . The measurements may also be temporarily unavailable because of recalibration, communication faults or other errors. Here we present a method that handles such delay and missing observations. The model is based on zero order hold stochastic differential equations which use binary signals for influent flow...

  4. Spacecraft intercept guidance using zero effort miss steering

    Science.gov (United States)

    Newman, Brett

    The suitability of proportional navigation, or an equivalent zero effort miss formulation, for spacecraft intercepts during midcourse guidance, followed by a ballistic coast to the endgame, is addressed. The problem is formulated in terms of relative motion in a general 3D framework. The proposed guidance law for the commanded thrust vector orientation consists of the sum of two terms: (1) along the line of sight unit direction and (2) along the zero effort miss component perpendicular to the line of sight and proportional to the miss itself and a guidance gain. If the guidance law is to be suitable for longer range targeting applications with significant ballistic coasting after burnout, determination of the zero effort miss must account for the different gravitational accelerations experienced by each vehicle. The proposed miss determination techniques employ approximations for the true differential gravity effect. Theoretical results are applied to a numerical engagement scenario and the resulting performance is evaluated in terms of the miss distances determined from nonlinear simulation.

  5. Dissolved organic nitrogen dynamics in the North Sea: A time series analysis (1995-2005)

    NARCIS (Netherlands)

    Van Engeland, T.; Soetaert, K.E.R.; Knuijt, A.; Laane, R.W.P.M.; Middelburg, J.J.

    2010-01-01

    Dissolved organic nitrogen (DON) dynamics in the North Sea was explored by means of long-term time series of nitrogen parameters from the Dutch national monitoring program. Generally, the data quality was good with little missing data points. Different imputation methods were used to verify the

  6. Methods to Minimize Zero-Missing Phenomenon

    DEFF Research Database (Denmark)

    da Silva, Filipe Miguel Faria; Bak, Claus Leth; Gudmundsdottir, Unnur Stella

    2010-01-01

    With the increasing use of high-voltage AC cables at transmission levels, phenomena such as current zero-missing start to appear more often in transmission systems. Zero-missing phenomenon can occur when energizing cable lines with shunt reactors. This may considerably delay the opening of the ci...

  7. Discovery and Fine-Mapping of Glycaemic and Obesity-Related Trait Loci Using High-Density Imputation.

    Science.gov (United States)

    Horikoshi, Momoko; Mӓgi, Reedik; van de Bunt, Martijn; Surakka, Ida; Sarin, Antti-Pekka; Mahajan, Anubha; Marullo, Letizia; Thorleifsson, Gudmar; Hӓgg, Sara; Hottenga, Jouke-Jan; Ladenvall, Claes; Ried, Janina S; Winkler, Thomas W; Willems, Sara M; Pervjakova, Natalia; Esko, Tõnu; Beekman, Marian; Nelson, Christopher P; Willenborg, Christina; Wiltshire, Steven; Ferreira, Teresa; Fernandez, Juan; Gaulton, Kyle J; Steinthorsdottir, Valgerdur; Hamsten, Anders; Magnusson, Patrik K E; Willemsen, Gonneke; Milaneschi, Yuri; Robertson, Neil R; Groves, Christopher J; Bennett, Amanda J; Lehtimӓki, Terho; Viikari, Jorma S; Rung, Johan; Lyssenko, Valeriya; Perola, Markus; Heid, Iris M; Herder, Christian; Grallert, Harald; Müller-Nurasyid, Martina; Roden, Michael; Hypponen, Elina; Isaacs, Aaron; van Leeuwen, Elisabeth M; Karssen, Lennart C; Mihailov, Evelin; Houwing-Duistermaat, Jeanine J; de Craen, Anton J M; Deelen, Joris; Havulinna, Aki S; Blades, Matthew; Hengstenberg, Christian; Erdmann, Jeanette; Schunkert, Heribert; Kaprio, Jaakko; Tobin, Martin D; Samani, Nilesh J; Lind, Lars; Salomaa, Veikko; Lindgren, Cecilia M; Slagboom, P Eline; Metspalu, Andres; van Duijn, Cornelia M; Eriksson, Johan G; Peters, Annette; Gieger, Christian; Jula, Antti; Groop, Leif; Raitakari, Olli T; Power, Chris; Penninx, Brenda W J H; de Geus, Eco; Smit, Johannes H; Boomsma, Dorret I; Pedersen, Nancy L; Ingelsson, Erik; Thorsteinsdottir, Unnur; Stefansson, Kari; Ripatti, Samuli; Prokopenko, Inga; McCarthy, Mark I; Morris, Andrew P

    2015-07-01

    Reference panels from the 1000 Genomes (1000G) Project Consortium provide near complete coverage of common and low-frequency genetic variation with minor allele frequency ≥0.5% across European ancestry populations. Within the European Network for Genetic and Genomic Epidemiology (ENGAGE) Consortium, we have undertaken the first large-scale meta-analysis of genome-wide association studies (GWAS), supplemented by 1000G imputation, for four quantitative glycaemic and obesity-related traits, in up to 87,048 individuals of European ancestry. We identified two loci for body mass index (BMI) at genome-wide significance, and two for fasting glucose (FG), none of which has been previously reported in larger meta-analysis efforts to combine GWAS of European ancestry. Through conditional analysis, we also detected multiple distinct signals of association mapping to established loci for waist-hip ratio adjusted for BMI (RSPO3) and FG (GCK and G6PC2). The index variant for one association signal at the G6PC2 locus is a low-frequency coding allele, H177Y, which has recently been demonstrated to have a functional role in glucose regulation. Fine-mapping analyses revealed that the non-coding variants most likely to drive association signals at established and novel loci were enriched for overlap with enhancer elements, which for FG mapped to promoter and transcription factor binding sites in pancreatic islets, in particular. Our study demonstrates that 1000G imputation and genetic fine-mapping of common and low-frequency variant association signals at GWAS loci, integrated with genomic annotation in relevant tissues, can provide insight into the functional and regulatory mechanisms through which their effects on glycaemic and obesity-related traits are mediated.

  8. Estimation of Tree Lists from Airborne Laser Scanning Using Tree Model Clustering and k-MSN Imputation

    Directory of Open Access Journals (Sweden)

    Jörgen Wallerman

    2013-04-01

    Full Text Available Individual tree crowns may be delineated from airborne laser scanning (ALS data by segmentation of surface models or by 3D analysis. Segmentation of surface models benefits from using a priori knowledge about the proportions of tree crowns, which has not yet been utilized for 3D analysis to any great extent. In this study, an existing surface segmentation method was used as a basis for a new tree model 3D clustering method applied to ALS returns in 104 circular field plots with 12 m radius in pine-dominated boreal forest (64°14'N, 19°50'E. For each cluster below the tallest canopy layer, a parabolic surface was fitted to model a tree crown. The tree model clustering identified more trees than segmentation of the surface model, especially smaller trees below the tallest canopy layer. Stem attributes were estimated with k-Most Similar Neighbours (k-MSN imputation of the clusters based on field-measured trees. The accuracy at plot level from the k-MSN imputation (stem density root mean square error or RMSE 32.7%; stem volume RMSE 28.3% was similar to the corresponding results from the surface model (stem density RMSE 33.6%; stem volume RMSE 26.1% with leave-one-out cross-validation for one field plot at a time. Three-dimensional analysis of ALS data should also be evaluated in multi-layered forests since it identified a larger number of small trees below the tallest canopy layer.

  9. Daniel Inman | NREL

    Science.gov (United States)

    . Bush, D. Inman, and R. Elmore. 2015. Evaluation of Methods for Comparison of Spatiotemporal and Time Series Datasets. Golden, CO: National Renewable Energy Laboratory. NREL/TP-6A20-62647. Vimmerstedt, L.J Elmore, and Brian Bush. "Case Study to Examine the Imputation of Missing Data to Improve Clustering

  10. Impact of teamwork on missed care in four Australian hospitals.

    Science.gov (United States)

    Chapman, Rose; Rahman, Asheq; Courtney, Mary; Chalmers, Cheyne

    2017-01-01

    Investigate effects of teamwork on missed nursing care across a healthcare network in Australia. Missed care is universally used as an indicator of quality nursing care, however, little is known about mitigating effects of teamwork on these events. A descriptive exploratory study. Missed Care and Team Work surveys were completed by 334 nurses. Using Stata software, nursing staff demographic information and components of missed care and teamwork were compared across the healthcare network. Statistical tests were performed to identify predicting factors for missed care. The most commonly reported components of missed care were as follows: ambulation three times per day (43·3%), turning patient every two hours (29%) and mouth care (27·7%). The commonest reasons mentioned for missed care were as follows: inadequate labour resources (range 69·8-52·7%), followed by material resources (range 59·3-33·3%) and communication (range 39·3-27·2%). There were significant differences in missed care scores across units. Using the mean scores in regression correlation matrix, the negative relationship of missed care and teamwork was supported (r = -0·34, p teamwork alone accounted for about 9% of missed nursing care. Similar to previous international research findings, our results showed nursing teamwork significantly impacted on missed nursing care. Teamwork may be a mitigating factor to address missed care and future research is needed. These results may provide administrators, educators and clinicians with information to develop practices and policies to improve patient care internationally. © 2016 John Wiley & Sons Ltd.

  11. Bayesian Approaches to Imputation, Hypothesis Testing, and Parameter Estimation

    Science.gov (United States)

    Ross, Steven J.; Mackey, Beth

    2015-01-01

    This chapter introduces three applications of Bayesian inference to common and novel issues in second language research. After a review of the critiques of conventional hypothesis testing, our focus centers on ways Bayesian inference can be used for dealing with missing data, for testing theory-driven substantive hypotheses without a default null…

  12. Maternal Near-Miss: A Multicenter Surveillance in Kathmandu Valley

    Directory of Open Access Journals (Sweden)

    Ashma Rana

    2013-06-01

    Full Text Available Introduction: Multicenter surveillance has been carried out on maternal near-miss in the hospitals with sentinel units. Near-miss is recognized as the predictor of level of care and maternal death. Reducing maternal mortality ratio is one of the challenges to achieve Millennium Development Goal. Objective was to determine the frequency and the nature of near-miss (severe acute maternal morbidity events and analysis of near-miss morbidities among pregnant women. Methods: Prospective surveillance was done for a year in 2012 in nine hospitals in Kathmandu valley. Cases eligible by definition recorded as a census based on WHO near-miss guideline. Similar questionnaire and dummy tables were used to present the result by non-inferential statistics. Results: Out of 157 cases identified with near-miss rate of 3.8, severe complications were PPH (40% and preeclampsia-eclampsia (17%. Blood transfusion (65%, ICU admission (54% and surgery (32% were the common critical intervention. Oxytocin was the main uterotonic used both prophylactically (86% and therapeutically (76%, and 19% arrived health facility after delivery or abortion. MgSO4 was used in all cases of eclampsia. All of the laparotomies were performed within 3 hours of arrival. Near-miss to mortality ratio was 6:1 and MMR 62. Conclusions: Study result yields similar pattern amongst developing countries and same near-miss conditions as the causes of maternal death reported by national statistics. Process indicators qualify the recommended standard of care. The near-miss event can be used as a surrogate marker of maternal death and a window for system level intervention. Keywords: abortion, eclampsia, hemorrhage, near-miss, surveillance

  13. Netbooks The Missing Manual

    CERN Document Server

    Biersdorfer, J

    2009-01-01

    Netbooks are the hot new thing in PCs -- small, inexpensive laptops designed for web browsing, email, and working with web-based programs. But chances are you don't know how to choose a netbook, let alone use one. Not to worry: with this Missing Manual, you'll learn which netbook is right for you and how to set it up and use it for everything from spreadsheets for work to hobbies like gaming and photo sharing. Netbooks: The Missing Manual provides easy-to-follow instructions and lots of advice to help you: Learn the basics for using a Windows- or Linux-based netbookConnect speakers, printe

  14. PCs The Missing Manual

    CERN Document Server

    Karp, David

    2005-01-01

    Your vacuum comes with one. Even your blender comes with one. But your PC--something that costs a whole lot more and is likely to be used daily and for tasks of far greater importance and complexity--doesn't come with a printed manual. Thankfully, that's not a problem any longer: PCs: The Missing Manual explains everything you need to know about PCs, both inside and out, and how to keep them running smoothly and working the way you want them to work. A complete PC manual for both beginners and power users, PCs: The Missing Manual has something for everyone. PC novices will appreciate the una

  15. What's missing in missing data? Omissions in survey responses among parents of children with advanced cancer.

    Science.gov (United States)

    Rosenberg, Abby R; Dussel, Veronica; Orellana, Liliana; Kang, Tammy; Geyer, J Russel; Feudtner, Chris; Wolfe, Joanne

    2014-08-01

    Missing data is a common phenomenon with survey-based research; patterns of missing data may elucidate why participants decline to answer certain questions. To describe patterns of missing data in the Pediatric Quality of Life and Evaluation of Symptoms Technology (PediQUEST) study, and highlight challenges in asking sensitive research questions. Cross-sectional, survey-based study embedded within a randomized controlled trial. Three large children's hospitals: Dana-Farber/Boston Children's Cancer and Blood Disorders Center (DF/BCCDC); Children's Hospital of Philadelphia (CHOP); and Seattle Children's Hospital (SCH). At the time of their child's enrollment, parents completed the Survey about Caring for Children with Cancer (SCCC), including demographics, perceptions of prognosis, treatment goals, quality of life, and psychological distress. Eighty-six of 104 parents completed surveys (83% response). The proportion of missing data varied by question type. While 14 parents (16%) left demographic fields blank, over half (n=48; 56%) declined to answer at least one question about their child's prognosis, especially life expectancy. The presence of missing data was unrelated to the child's diagnosis, time from progression, time to death, or parent distress (p>0.3 for each). Written explanations in survey margins suggested that addressing a child's life expectancy is particularly challenging for parents. Parents of children with cancer commonly refrain from answering questions about their child's prognosis, however, they may be more likely to address general cure likelihood than explicit life expectancy. Understanding acceptability of sensitive questions in survey-based research will foster higher quality palliative care research.

  16. Multiple-Trait Genomic Selection Methods Increase Genetic Value Prediction Accuracy

    Science.gov (United States)

    Jia, Yi; Jannink, Jean-Luc

    2012-01-01

    Genetic correlations between quantitative traits measured in many breeding programs are pervasive. These correlations indicate that measurements of one trait carry information on other traits. Current single-trait (univariate) genomic selection does not take advantage of this information. Multivariate genomic selection on multiple traits could accomplish this but has been little explored and tested in practical breeding programs. In this study, three multivariate linear models (i.e., GBLUP, BayesA, and BayesCπ) were presented and compared to univariate models using simulated and real quantitative traits controlled by different genetic architectures. We also extended BayesA with fixed hyperparameters to a full hierarchical model that estimated hyperparameters and BayesCπ to impute missing phenotypes. We found that optimal marker-effect variance priors depended on the genetic architecture of the trait so that estimating them was beneficial. We showed that the prediction accuracy for a low-heritability trait could be significantly increased by multivariate genomic selection when a correlated high-heritability trait was available. Further, multiple-trait genomic selection had higher prediction accuracy than single-trait genomic selection when phenotypes are not available on all individuals and traits. Additional factors affecting the performance of multiple-trait genomic selection were explored. PMID:23086217

  17. Missed opportunities for HPV immunization among young adult women

    Science.gov (United States)

    Oliveira, Carlos R.; Rock, Robert M.; Shapiro, Eugene D.; Xu, Xiao; Lundsberg, Lisbet; Zhang, Liye B.; Gariepy, Aileen; Illuzzi, Jessica L.; Sheth, Sangini S.

    2018-01-01

    BACKGROUND Despite the availability of a safe and efficacious vaccine against human papillomavirus, uptake of the vaccine in the United States is low. Missed clinical opportunities to recommend and to administer human papillomavirus vaccine are considered one of the most important reasons for its low uptake in adolescents; however, little is known about the frequency or characteristics of missed opportunities in the young adult (18–26 years of age) population. OBJECTIVE The objective of the study was to assess both the rates of and the factors associated with missed opportunities for human papillomavirus immunization among young adult women who attended an urban obstetrics and gynecology clinic. STUDY DESIGN In this cross-sectional study, medical records were reviewed for all women 18–26 years of age who were underimmunized (<3 doses) and who sought care from Feb. 1, 2013, to January 31, 2014, at an urban, hospital-based obstetrics and gynecology clinic. A missed opportunity for human papillomavirus immunization was defined as a clinic visit at which the patient was eligible to receive the vaccine and a dose was due but not administered. Multivariable logistic regression was used to test associations between sociodemographic variables and missed opportunities. RESULTS There were 1670 vaccine-eligible visits by 1241 underimmunized women, with a mean of 1.3 missed opportunities/person. During the study period, 833 of the vaccine eligible women (67.1%) had at least 1 missed opportunity. Overall, the most common types of visits during which a missed opportunity occurred were postpartum visits (17%) or visits for either sexually transmitted disease screening (21%) or contraception (33%). Of the patients with a missed opportunity, 26.5% had a visit at which an injectable medication or a different vaccine was administered. Women who identified their race as black had higher adjusted odds of having a missed opportunity compared with white women (adjusted odds ratio, 1

  18. Discovery and Fine-Mapping of Glycaemic and Obesity-Related Trait Loci Using High-Density Imputation.

    Directory of Open Access Journals (Sweden)

    Momoko Horikoshi

    2015-07-01

    Full Text Available Reference panels from the 1000 Genomes (1000G Project Consortium provide near complete coverage of common and low-frequency genetic variation with minor allele frequency ≥0.5% across European ancestry populations. Within the European Network for Genetic and Genomic Epidemiology (ENGAGE Consortium, we have undertaken the first large-scale meta-analysis of genome-wide association studies (GWAS, supplemented by 1000G imputation, for four quantitative glycaemic and obesity-related traits, in up to 87,048 individuals of European ancestry. We identified two loci for body mass index (BMI at genome-wide significance, and two for fasting glucose (FG, none of which has been previously reported in larger meta-analysis efforts to combine GWAS of European ancestry. Through conditional analysis, we also detected multiple distinct signals of association mapping to established loci for waist-hip ratio adjusted for BMI (RSPO3 and FG (GCK and G6PC2. The index variant for one association signal at the G6PC2 locus is a low-frequency coding allele, H177Y, which has recently been demonstrated to have a functional role in glucose regulation. Fine-mapping analyses revealed that the non-coding variants most likely to drive association signals at established and novel loci were enriched for overlap with enhancer elements, which for FG mapped to promoter and transcription factor binding sites in pancreatic islets, in particular. Our study demonstrates that 1000G imputation and genetic fine-mapping of common and low-frequency variant association signals at GWAS loci, integrated with genomic annotation in relevant tissues, can provide insight into the functional and regulatory mechanisms through which their effects on glycaemic and obesity-related traits are mediated.

  19. A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets

    Science.gov (United States)

    Carrig, Madeline M.; Manrique-Vallier, Daniel; Ranby, Krista W.; Reiter, Jerome P.; Hoyle, Rick H.

    2015-01-01

    Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches. PMID:26257437

  20. A three-source capture-recapture estimate of the number of new HIV diagnoses in children in France from 2003–2006 with multiple imputation of a variable of heterogeneous catchability

    Directory of Open Access Journals (Sweden)

    Héraud-Bousquet Vanina

    2012-10-01

    Full Text Available Abstract Background Nearly all HIV infections in children worldwide are acquired through mother-to-child transmission (MTCT during pregnancy, labour, delivery or breastfeeding. The objective of our study was to estimate the number and rate of new HIV diagnoses in children less than 13 years of age in mainland France from 2003–2006. Methods We performed a capture-recapture analysis based on three sources of information: the mandatory HIV case reporting (DOVIH, the French Perinatal Cohort (ANRS-EPF and a laboratory-based surveillance of HIV (LaboVIH. The missing values of a variable of heterogeneous catchability were estimated through multiple imputation. Log-linear modelling provided estimates of the number of new HIV infections in children, taking into account dependencies between sources and variables of heterogeneous catchability. Results The three sources observed 216 new HIV diagnoses after record-linkage. The number of new HIV diagnoses in children was estimated at 387 (95%CI [271–503] from 2003–2006, among whom 60% were born abroad. The estimated rate of new HIV diagnoses in children in mainland France was 9.1 per million in 2006 and was 38 times higher in children born abroad than in those born in France. The estimated completeness of the three sources combined was 55.8% (95% CI [42.9 – 79.7] and varied according to the source; the completeness of DOVIH (28.4% and ANRS-EPF (26.1% were lower than that of LaboVIH (33.3%. Conclusion Our study provided, for the first time, an estimated annual rate of new HIV diagnoses in children under 13 years old in mainland France. A more systematic HIV screening of pregnant women that is repeated during pregnancy among women likely to engage in risky behaviour is needed to optimise the prevention of MTCT. HIV screening for children who migrate from countries with high HIV prevalence to France could be recommended to facilitate early diagnosis and treatment.

  1. A systematic review of missed opportunities for improving tuberculosis and HIV/AIDS control in Sub-saharan Africa: what is still missed by health experts?

    Science.gov (United States)

    Keugoung, Basile; Fouelifack, Florent Ymele; Fotsing, Richard; Macq, Jean; Meli, Jean; Criel, Bart

    2014-01-01

    In sub-Saharan Africa, HIV/AIDS and tuberculosis are major public health problems. In 2010, 64% of the 34 million of people infected with HIV were reported to be living in sub-Saharan Africa. Only 41% of eligible HIV-positive people had access to antiretroviral therapy (ART). Regarding tuberculosis, in 2010, the region had 12% of the world's population but reported 26% of the 8.8 million incident cases and 254000 tuberculosis-related deaths. This paper aims to review missed opportunities for improving HIV/AIDS and tuberculosis prevention and care. We conducted a systematic review in PubMed using the terms 'missed'(Title) AND 'opportunities'(Title). We included systematic review and original research articles done in sub-Saharan Africa on missed opportunities in HIV/AIDS and/or tuberculosis care. Missed opportunities for improving HIV/AIDS and/or tuberculosis care can be classified into five categories: i) patient and community; ii) health professional; iii) health facility; iv) local health system; and v) vertical programme (HIV/AIDS and/or tuberculosis control programmes). None of the reviewed studies identified any missed opportunities related to health system strengthening. Opportunities that are missed hamper tuberculosis and/or HIV/AIDS care in sub-Saharan Africa where health systems remain weak. What is still missing in the analysis of health experts is the acknowledgement that opportunities that are missed to strengthen health systems also undermine tuberculosis and HIV/AIDS prevention and care. Studying why these opportunities are missed will help to understand the rationales behind the missed opportunities, and customize adequate strategies to seize them and for effective diseases control.

  2. All hypertopologies are hit-and-miss

    Directory of Open Access Journals (Sweden)

    Somshekhar Naimpally

    2002-04-01

    Full Text Available We solve a long standing problem by showing that all known hypertopologies are hit-and-miss. Our solution is not merely of theoretical importance. This representation is useful in the study of comparison of the Hausdorff-Bourbaki or H-B uniform topologies and the Wijsman topologies among themselves and with others. Up to now some of these comparisons needed intricate manipulations. The H-B uniform topologies were the subject of intense activity in the 1960's in connection with the Isbell-Smith problem. We show that they are proximally locally finite topologies from which the solution to the above problem follows easily. It is known that the Wijsman topology on the hyperspace is the proximal ball (hit-and-miss topology in”nice” metric spaces including the normed linear spaces. With the introduction of a new far-miss topology we show that the Wijsman topology is hit-and-miss for all metric spaces. From this follows a natural generalization of the Wijsman topology to the hyperspace of any T1 space. Several existing results in the literature are easy consequences of our work.

  3. Gap-filling meteorological variables with Empirical Orthogonal Functions

    Science.gov (United States)

    Graf, Alexander

    2017-04-01

    Gap-filling or modelling surface-atmosphere fluxes critically depends on an, ideally continuous, availability of their meteorological driver variables, such as e.g. air temperature, humidity, radiation, wind speed and precipitation. Unlike for eddy-covariance-based fluxes, data gaps are not unavoidable for these measurements. Nevertheless, missing or erroneous data can occur in practice due to instrument or power failures, disturbance, and temporary sensor or station dismounting for e.g. agricultural management or maintenance. If stations with similar measurements are available nearby, using their data for imputation (i.e. estimating missing data) either directly, after an elevation correction or via linear regression, is usually preferred over linear interpolation or monthly mean diurnal cycles. The popular implementation of regional networks of (partly low-cost) stations increases both, the need and the potential, for such neighbour-based imputation methods. For repeated satellite imagery, Beckers and Rixen (2003) suggested an imputation method based on empirical orthogonal functions (EOFs). While exploiting the same linear relations between time series at different observation points as regression, it is able to use information from all observation points to simultaneously estimate missing data at all observation points, provided that never all observations are missing at the same time. Briefly, the method uses the ability of the first few EOFs of a data matrix to reconstruct a noise-reduced version of this matrix; iterating missing data points from an initial guess (the column-wise averages) to an optimal version determined by cross-validation. The poster presents and discusses lessons learned from adapting and applying this methodology to station data. Several years of 10-minute averages of air temperature, pressure and humidity, incoming shortwave, longwave and photosynthetically active radiation, wind speed and precipitation, measured by a regional (70 km by

  4. Maternal near-miss: a multicenter surveillance in Kathmandu Valley.

    Science.gov (United States)

    Rana, Ashma; Baral, Gehanath; Dangal, Ganesh

    2013-01-01

    Multicenter surveillance has been carried out on maternal near-miss in the hospitals with sentinel units. Near-miss is recognized as the predictor of level of care and maternal death. Reducing Maternal Mortality Ratio is one of the challenges to achieve Millennium Development Goal. The objective was to determine the frequency and the nature of near-miss events and to analyze the near-miss morbidities among pregnant women. A prospective surveillance was done for a year in 2012 at nine hospitals in Kathmandu valley. Cases eligible by definition were recorded as a census based on WHO near-miss guideline. Similar questionnaires and dummy tables were used to present the results by non-inferential statistics. Out of 157 cases identified with near-miss rate of 3.8 per 1000 live births, severe complications were postpartum hemorrhage 62 (40%) and preeclampsia-eclampsia 25 (17%). Blood transfusion 102 (65%), ICU admission 85 (54%) and surgery 53 (32%) were common critical interventions. Oxytocin was main uterotonic used both prophylactically and therapeutically at health facilities. Total of 30 (19%) cases arrived at health facility after delivery or abortion. MgSO4 was used in all cases of eclampsia. All laparotomies were performed within three hours of arrival. Near-miss to maternal death ratio was 6:1 and MMR was 62. Study result yielded similar pattern amongst developing countries and same near-miss conditions as the causes of maternal death reported by national statistics. Process indicators qualified the recommended standard of care. The near-miss event could be used as a surrogate marker of maternal death and a window for system level intervention.

  5. Assessment of Consequences of Replacement of System of the Uniform Tax on Imputed Income Patent System of the Taxation

    Directory of Open Access Journals (Sweden)

    Galina A. Manokhina

    2012-11-01

    Full Text Available The article highlights the main questions concerning possible consequences of replacement of nowadays operating system in the form of a single tax in reference to imputed income with patent system of the taxation. The main advantages and drawbacks of new system of the taxation are shown, including the opinion that not the replacement of one special mode of the taxation with another is more effective, but the introduction of patent a taxation system as an auxilary system.

  6. Missed Opportunities For Immunization In Children And Pregnant Women

    Directory of Open Access Journals (Sweden)

    Benjamin A I

    1990-01-01

    Full Text Available The role of immunization in reducing childhood mortality cannot be over-emphasised, yet many opportunities for immunization are missed when children and pregnant women visit a health facility. Reducing missed opportunities is the cheapest way to increase immunization coverage. The present study discusses the extent of the problem of missed opportunities for immunization in children and pregnant women and the factors contributing to the problem, in spatiality and community outreach clinics of Christian Medical College & Hospital, Ludhiana. Recommendations are made regarding ways and means of reducing missed opportunities.

  7. Methods for Handling Missing Secondary Respondent Data

    Science.gov (United States)

    Young, Rebekah; Johnson, David

    2013-01-01

    Secondary respondent data are underutilized because researchers avoid using these data in the presence of substantial missing data. The authors reviewed, evaluated, and tested solutions to this problem. Five strategies of dealing with missing partner data were reviewed: (a) complete case analysis, (b) inverse probability weighting, (c) correction…

  8. Reducing Competitive Cache Misses in Modern Processor Architectures

    OpenAIRE

    Prisagjanec, Milcho; Mitrevski, Pece

    2017-01-01

    The increasing number of threads inside the cores of a multicore processor, and competitive access to the shared cache memory, become the main reasons for an increased number of competitive cache misses and performance decline. Inevitably, the development of modern processor architectures leads to an increased number of cache misses. In this paper, we make an attempt to implement a technique for decreasing the number of competitive cache misses in the first level of cache memory. This tec...

  9. Missed opportunities in crystallography.

    Science.gov (United States)

    Dauter, Zbigniew; Jaskolski, Mariusz

    2014-09-01

    Scrutinized from the perspective of time, the giants in the history of crystallography more than once missed a nearly obvious chance to make another great discovery, or went in the wrong direction. This review analyzes such missed opportunities focusing on macromolecular crystallographers (using Perutz, Pauling, Franklin as examples), although cases of particular historical (Kepler), methodological (Laue, Patterson) or structural (Pauling, Ramachandran) relevance are also described. Linus Pauling, in particular, is presented several times in different circumstances, as a man of vision, oversight, or even blindness. His example underscores the simple truth that also in science incessant creativity is inevitably connected with some probability of fault. Published 2014. This article is a U.S. Government work and is in the public domain in the USA.

  10. Evaluation of Estimating Missed Answers in Conners Adult ADHD Rating Scale (Screening Version)

    Science.gov (United States)

    Ghassemi, Farnaz; Moradi, Mohammad Hassan; Tehrani-Doost, Mehdi

    2010-01-01

    Objective Conners Adult ADHD Rating Scale (CAARS) is among the valid questionnaires for evaluating Attention-Deficit/Hyperactivity Disorder in adults. The aim of this paper is to evaluate the validity of the estimation of missed answers in scoring the screening version of the Conners questionnaire, and to extract its principal components. Method This study was performed on 400 participants. Answer estimation was calculated for each question (assuming the answer was missed), and then a Kruskal-Wallis test was performed to evaluate the difference between the original answer and its estimation. In the next step, principal components of the questionnaire were extracted by means of Principal Component Analysis (PCA). Finally the evaluation of differences in the whole groups was provided using the Multiple Comparison Procedure (MCP). Results Findings indicated that a significant difference existed between the original and estimated answers for some particular questions. However, the results of MCP showed that this estimation, when evaluated in the whole group, did not show a significant difference with the original value in neither of the questionnaire subscales. The results of PCA revealed that there are eight principal components in the CAARS questionnaire. Conclusion The obtained results can emphasize the fact that this questionnaire is mainly designed for screening purposes, and this estimation does not change the results of groups when a question is missed randomly. Notwithstanding this finding, more considerations should be paid when the missed question is a critical one. PMID:22952502

  11. Evaluation of Estimating Missed Answers in Conners Adult ADHD Rating Scale (Screening Version

    Directory of Open Access Journals (Sweden)

    Vahid Abootalebi

    2010-08-01

    Full Text Available "n Objective: Conners Adult ADHD Rating Scale (CAARS is among the valid questionnaires for evaluating Attention-Deficit/Hyperactivity Disorder in adults. The aim of this paper is to evaluate the validity of the estimation of missed answers in scoring the screening version of the Conners questionnaire, and to extract its principal components. "n Method: This study was performed on 400 participants. Answer estimation was calculated for each question (assuming the answer was missed, and then a Kruskal-Wallis test was performed to evaluate the difference between the original answer and its estimation. In the next step, principal components of the questionnaire were extracted by means of Principal Component Analysis (PCA. Finally the evaluation of differences in the whole groups was provided using the Multiple Comparison Procedure (MCP. Results: Findings indicated that a significant difference existed between the original and estimated answers for some particular questions. However, the results of MCP showed that this estimation, when evaluated in the whole group, did not show a significant difference with the original value in neither of the questionnaire subscales. The results of PCA revealed that there are eight principal components in the CAARS questionnaire. Conclusion: The obtained results can emphasize the fact that this questionnaire is mainly designed for screening purposes, and this estimation does not change the results of groups when a question is missed randomly. Notwithstanding this finding, more considerations should be paid when the missed question is a critical one.

  12. Estimation of caries experience by multiple imputation and direct standardization

    NARCIS (Netherlands)

    Schuller, A. A.; Van Buuren, S.

    2014-01-01

    Valid estimates of caries experience are needed to monitor oral population health. Obtaining such estimates in practice is often complicated by nonresponse and missing data. The goal of this study was to estimate caries experiences in a population of children aged 5 and 11 years, in the presence of

  13. Estimation of Caries Experience by Multiple Imputation and Direct Standardization

    NARCIS (Netherlands)

    Schuller, A. A.; van Buuren, S.

    2014-01-01

    Valid estimates of caries experience are needed to monitor oral population health. Obtaining such estimates in practice is often complicated by nonresponse and missing data. The goal of this study was to estimate caries experiences in a population of children aged 5 and 11 years, in the presence of

  14. Student versus Faculty Perceptions of Missing Class.

    Science.gov (United States)

    Sleigh, Merry J.; Ritzer, Darren R.; Casey, Michael B.

    2002-01-01

    Examines and compares student and faculty attitudes towards students missing classes and class attendance. Surveys undergraduate students (n=231) in lower and upper level psychology courses and psychology faculty. Reports that students found more reasons acceptable for missing classes and that the amount of in-class material on the examinations…

  15. A Hybrid Method for Interpolating Missing Data in Heterogeneous Spatio-Temporal Datasets

    Directory of Open Access Journals (Sweden)

    Min Deng

    2016-02-01

    Full Text Available Space-time interpolation is widely used to estimate missing or unobserved values in a dataset integrating both spatial and temporal records. Although space-time interpolation plays a key role in space-time modeling, existing methods were mainly developed for space-time processes that exhibit stationarity in space and time. It is still challenging to model heterogeneity of space-time data in the interpolation model. To overcome this limitation, in this study, a novel space-time interpolation method considering both spatial and temporal heterogeneity is developed for estimating missing data in space-time datasets. The interpolation operation is first implemented in spatial and temporal dimensions. Heterogeneous covariance functions are constructed to obtain the best linear unbiased estimates in spatial and temporal dimensions. Spatial and temporal correlations are then considered to combine the interpolation results in spatial and temporal dimensions to estimate the missing data. The proposed method is tested on annual average temperature and precipitation data in China (1984–2009. Experimental results show that, for these datasets, the proposed method outperforms three state-of-the-art methods—e.g., spatio-temporal kriging, spatio-temporal inverse distance weighting, and point estimation model of biased hospitals-based area disease estimation methods.

  16. A quantification strategy for missing bone mass in case of osteolytic bone lesions

    International Nuclear Information System (INIS)

    Fränzle, Andrea; Giske, Kristina; Bretschi, Maren; Bäuerle, Tobias; Hillengass, Jens; Bendl, Rolf

    2013-01-01

    Purpose: Most of the patients who died of breast cancer have developed bone metastases. To understand the pathogenesis of bone metastases and to analyze treatment response of different bone remodeling therapies, preclinical animal models are examined. In breast cancer, bone metastases are often bone destructive. To assess treatment response of bone remodeling therapies, the volumes of these lesions have to be determined during the therapy process. The manual delineation of missing structures, especially if large parts are missing, is very time-consuming and not reproducible. Reproducibility is highly important to have comparable results during the therapy process. Therefore, a computerized approach is needed. Also for the preclinical research, a reproducible measurement of the lesions is essential. Here, the authors present an automated segmentation method for the measurement of missing bone mass in a preclinical rat model with bone metastases in the hind leg bones based on 3D CT scans. Methods: The affected bone structure is compared to a healthy model. Since in this preclinical rat trial the metastasis only occurs on the right hind legs, which is assured by using vessel clips, the authors use the left body side as a healthy model. The left femur is segmented with a statistical shape model which is initialised using the automatically segmented medullary cavity. The left tibia and fibula are segmented using volume growing starting at the tibia medullary cavity and stopping at the femur boundary. Masked images of both segmentations are mirrored along the median plane and transferred manually to the position of the affected bone by rigid registration. Affected bone and healthy model are compared based on their gray values. If the gray value of a voxel indicates bone mass in the healthy model and no bone in the affected bone, this voxel is considered to be osteolytic. Results: The lesion segmentations complete the missing bone structures in a reasonable way. The mean

  17. Robust PLS approach for KPI-related prediction and diagnosis against outliers and missing data

    Science.gov (United States)

    Yin, Shen; Wang, Guang; Yang, Xu

    2014-07-01

    In practical industrial applications, the key performance indicator (KPI)-related prediction and diagnosis are quite important for the product quality and economic benefits. To meet these requirements, many advanced prediction and monitoring approaches have been developed which can be classified into model-based or data-driven techniques. Among these approaches, partial least squares (PLS) is one of the most popular data-driven methods due to its simplicity and easy implementation in large-scale industrial process. As PLS is totally based on the measured process data, the characteristics of the process data are critical for the success of PLS. Outliers and missing values are two common characteristics of the measured data which can severely affect the effectiveness of PLS. To ensure the applicability of PLS in practical industrial applications, this paper introduces a robust version of PLS to deal with outliers and missing values, simultaneously. The effectiveness of the proposed method is finally demonstrated by the application results of the KPI-related prediction and diagnosis on an industrial benchmark of Tennessee Eastman process.

  18. The Impact of the Nursing Practice Environment on Missed Nursing Care.

    Science.gov (United States)

    Hessels, Amanda J; Flynn, Linda; Cimiotti, Jeannie P; Cadmus, Edna; Gershon, Robyn R M

    2015-12-01

    Missed nursing care is an emerging problem negatively impacting patient outcomes. There are gaps in our knowledge of factors associated with missed nursing care. The aim of this study was to determine the relationship between the nursing practice environment and missed nursing care in acute care hospitals. This is a secondary analysis of cross sectional data from a survey of over 7.000 nurses from 70 hospitals on workplace and process of care. Ordinary least squares and multiple regression models were constructed to examine the relationship between the nursing practice environment and missed nursing care while controlling for characteristics of nurses and hospitals. Nurses missed delivering a significant amount of necessary patient care (10-27%). Inadequate staffing and inadequate resources were the practice environment factors most strongly associated with missed nursing care events. This multi-site study examined the risk and risk factors associated with missed nursing care. Improvements targeting modifiable risk factors may reduce the risk of missed nursing care.

  19. Incidence and Correlates of Maternal Near Miss in Southeast Iran

    Directory of Open Access Journals (Sweden)

    Tayebeh Naderi

    2015-01-01

    Full Text Available This prospective study aimed to estimate the incidence and associated factors of severe maternal morbidity in southeast Iran. During a 9-month period in 2013, all women referring to eight hospitals for termination of pregnancy as well as women admitted during 42 days after the termination of pregnancy were enrolled into the study. Maternal near miss conditions were defined based on Say et al.’s recommendations. Five hundred and one cases of maternal near miss and 19,908 live births occurred in the study period, yielding a maternal near miss ratio of 25.2 per 1000 live births. This rate was 7.5 and 105 per 1000 in private and tertiary care settings, respectively. The rate of maternal death in near miss cases was 0.40% with a case:fatality ratio of 250 : 1. The most prevalent causes of near miss were severe preeclampsia (27.3%, ectopic pregnancy (18.4%, and abruptio placentae (16.2%. Higher age, higher education, and being primiparous were associated with a higher risk of near miss. Considering the high rate of maternal near miss in referral hospitals, maternal near miss surveillance system should be set up in these hospitals to identify cases of severe maternal morbidity as soon as possible.

  20. Droid X The Missing Manual

    CERN Document Server

    Gralla, Preston

    2011-01-01

    Get the most from your Droid X right away with this entertaining Missing Manual. Veteran tech author Preston Gralla offers a guided tour of every feature, with lots of expert tips and tricks along the way. You'll learn how to use calling and texting features, take and share photos, enjoy streaming music and video, and much more. Packed with full-color illustrations, this engaging book covers everything from getting started to advanced features and troubleshooting. Unleash the power of Motorola's hot new device with Droid X: The Missing Manual. Get organized. Import your contacts and sync wit

  1. Reduction of artefacts due to missing projections using OSEM

    International Nuclear Information System (INIS)

    Hutton, B.F.; Kyme, A.; Choong, K.

    2002-01-01

    Full text: It is well recognised that missing or corrupted projections can result in artefacts. This occasionally occurs due to errors in data transfer from acquisition memory to disk. A possible approach for reducing these artefacts was investigated, Using ordered subsets expectation maximization (OSEM) the iterative reconstruction proceeds by progressively including additional projections until a single iteration is complete. Clinically useful results can be obtained using a small subset size in a single iteration. Stopping prior to the complete iteration so as to avoid inclusion of missing or corrupted data should provide a 'partial' reconstruction with minimal artefacts. To test this hypothesis projections were selectively removed from a complete data set (2, 4, 8, 12 adjacent projections) and reconstructions were performed using both filtered back projection (FBP) and OSEM. To maintain a constant number of sub-iterations in OSEM an equal number of duplicate projections were substituted for the missing projections. Both 180 and 360 degrees reconstructions with missing data were compared with reconstruction for the complete data using sum of absolute differences. Results indicate that missing data causes artefacts for both FBP and OSEM however the severity of artefacts is significantly reduced using OSEM. The effect of missing data is generally greater for 180 degrees acquisition. OSEM is recommended for minimising reconstruction artefacts due to missing projections. Copyright (2002) The Australian and New Zealand Society of Nuclear Medicine Inc

  2. Essays in economics: 1. Pre-committed government spending and partisan politics. 2. Investment in energy efficiency: Do the characteristics of firms matter? 3. Information processing and organizational structure

    Science.gov (United States)

    Watkins, William Edward, Jr.

    1. Spending commitments requiring future outlays are important for understanding partisan politics because they prevent a conservative government from scaling back spending programs. In a one-government-good model, a "stubborn liberal" policy maker can use precommitted spending to prevent a later conservative government from imposing spending cuts. In a model where parties differ about spending priorities, re-election uncertainty creates a bias towards higher government spending and higher taxes. 2. The literature on energy efficiency provides examples of profitable technologies that are not universally adopted. Theory indicates that firms should undertake all investments with a positive net present value, and that the discount rate for computing the present value of a project should be the return available on other projects in the same risk class, not on characteristics of the firm. This model is tested by examining whether firms' characteristics influence their decision to join the Environmental Protection Agency's Green Lights program. A discrete choice regression is estimated over a sample of participating and non-participating firms. Missing values in the data matrix are replaced with multiple imputations using the EM algorithm. The results show that: (1) substantial improvements in the power of hypothesis tests can be achieved through imputation of missing data, and (2) characteristics of firms do affect their decision to join Green Lights. 3. Standard theories of the firm stress profit maximization as the foundation for derivation of predictable behavior. Yet evidence continues to accumulate that firms do not act as required by the neoclassical framework. Instead of being represented by ever more elaborate maximization models, the firm can be modeled simply as a network of information-processing agents. The actions of the firm are then a function only of the network structure and the information-processing capabilities of the agents. This approach can be

  3. Intention-to-treat analysis in the chronic suppurative otitis media trials

    African Journals Online (AJOL)

    There were no attempts in any of the trials to impute for missing responses and carrying out a sensitivity analysis. For trials with a big percentage of protocol deviations, the validity of their results are brought to question. Conclusions: In practice, not all those entered into a randomised-controlled trial will complete the trial.

  4. Pre-validation of the WHO organ dysfunction based criteria for identification of maternal near miss

    Directory of Open Access Journals (Sweden)

    Parpinelli Mary A

    2011-08-01

    Full Text Available Abstract Background To evaluate the performance of the WHO criteria for defining maternal near miss and identifying deaths among cases of severe maternal morbidity (SMM admitted for intensive care. Method Between October 2002 and September 2007, 673 women with SMM were admitted, and among them 18 died. Variables used for the definition of maternal near miss according to WHO criteria and for the SOFA score were retrospectively evaluated. The identification of at least one of the WHO criteria in women who did not die defined the case as a near miss. Organ failure was evaluated through the maximum SOFA score above 2 for each one of the six components of the score, being considered the gold standard for the diagnosis of maternal near miss. The aggregated score (Total Maximum SOFA score was calculated using the worst result of the maximum SOFA score. Sensitivity, specificity, positive and negative predictive values of these WHO criteria for predicting maternal death and also for identifying cases of organ failure were estimated. Results The WHO criteria identified 194 cases of maternal near miss and all the 18 deaths. The most prevalent criteria among cases of maternal deaths were the use of vasoactive drug and the use of mechanical ventilation (≥1 h. For the prediction of maternal deaths, sensitivity was 100% and specificity 70.4%. These criteria identified 119 of the 120 cases of organ failure by the maximum SOFA score (Sensitivity 99.2% among 194 case of maternal near miss (61.34%. There was disagreement in 76 cases, one organ failure without any WHO criteria and 75 cases with no failure but with WHO criteria. The Total Maximum SOFA score had a good performance (area under the curve of 0.897 for prediction of cases of maternal near miss according to the WHO criteria. Conclusions The WHO criteria for maternal near miss showed to be able to identify all cases of death and almost all cases of organ failure. Therefore they allow evaluation of the

  5. Maternal near-miss in a rural hospital in Sudan

    Directory of Open Access Journals (Sweden)

    Adam Gamal K

    2011-06-01

    Full Text Available Abstract Background Investigation of maternal near-miss is a useful complement to the investigation of maternal mortality with the aim of meeting the United Nations' fifth Millennium Development Goal. The present study was conducted to investigate the frequency of near-miss events, to calculate the mortality index for each event and to compare the socio-demographic and obstetrical data (age, parity, gestational age, education and antenatal care of the near-miss cases with maternal deaths. Methods Near-miss cases and events (hemorrhage, infection, hypertensive disorders, anemia and dystocia, maternal deaths and their causes were retrospectively reviewed and the mortality index for each event was calculated in Kassala Hospital, eastern Sudan over a 2-year period, from January 2008 to December 2010. Disease-specific criteria were applied for these events. Results There were 9578 deliveries, 205 near-miss cases, 228 near-miss events and 40 maternal deaths. Maternal near-miss and maternal mortality ratio were 22.1/1000 live births and 432/100 000 live births, respectively. Hemorrhage accounted for the most common event (40.8%, followed by infection (21.5%, hypertensive disorders (18.0%, anemia (11.8% and dystocia (7.9%. The mortality index were 22.2%, 10.0%, 10.0%, 8.8% and 2.4% for infection, dystocia, anemia, hemorrhage and hypertensive disorders, respectively. Conclusion There is a high frequency of maternal morbidity and mortality at the level of this facility. Therefore maternal health policy needs to be concerned not only with averting the loss of life, but also with preventing or ameliorating maternal-near miss events (hemorrhage, infections, hypertension and anemia at all care levels including primary level.

  6. Rethinking Value in the Bio-economy: Finance, Assetization, and the Management of Value.

    Science.gov (United States)

    Birch, Kean

    2017-05-01

    Current debates in science and technology studies emphasize that the bio-economy-or, the articulation of capitalism and biotechnology-is built on notions of commodity production, commodification, and materiality, emphasizing that it is possible to derive value from body parts, molecular and cellular tissues, biological processes, and so on. What is missing from these perspectives, however, is consideration of the political-economic actors, knowledges, and practices involved in the creation and management of value. As part of a rethinking of value in the bio-economy, this article analyzes three key political-economic processes: financialization, capitalization, and assetization. In doing so, it argues that value is managed as part of a series of valuation practices, it is not inherent in biological materialities.

  7. Missing transverse momentum in ATLAS: current and future performance

    CERN Document Server

    Schramm, S; The ATLAS collaboration

    2014-01-01

    During the Run-I data taking period, ATLAS has developed and refined several approaches for measuring missing transverse momentum in proton-proton collisions. Standard calorimeter-based $\\mathrm{E}_\\mathrm{T}^\\mathrm{miss}$ reconstruction techniques have been improved to obtain new levels of precision, while new track-based $\\mathrm{p}_\\mathrm{T}^\\mathrm{miss}$ methods provide for a way to have a second independent measurement of the momentum lost due to particles which do not leave tracks in the inner detectors. While both procedures are individually useful, preliminary studies have shown that combining information from both techniques leads to an improved understanding of missing transverse momentum. Data taking conditions during Run-I varied extensively, especially with respect to the amount of pileup activity present in each event, which provides unique challenges to calorimeter-based $\\mathrm{E}_\\mathrm{T}^\\mathrm{miss}$. Multiple solutions have been demonstrated, including methods which exploit both cal...

  8. Fair Value and the Missing Correspondence between Accounting and Auditing

    DEFF Research Database (Denmark)

    Liempd, Dennis van; Jeppesen, Kim Klarskov

    . The second turning point is from historical cost to fair values, which changes the ontological premises of accounting thought in the direction of ontological subjectivity. This has profound consequences for auditing, which so far has failed to develop new epistemic criteria to deal with fair value accounting’s......The paper outlines the history of accounting and auditing thought, demonstrating how auditing has had an isomorphic relationship with accounting throughout most of its history. We focus this history on two turning points. The first is the turn from reporting the truth to reporting in accordance...... with Generally Accepted Accounting Principles, which occurred in the 1960s as a result of the politicization of accounting standard setting. We argue that auditing adapted to this change by establishing a conceptual framework based on checking the correspondence between assertions and established criteria...

  9. Exploratory factor analysis and reliability analysis with missing data: A simple method for SPSS users

    Directory of Open Access Journals (Sweden)

    Bruce Weaver

    2014-09-01

    Full Text Available Missing data is a frequent problem for researchers conducting exploratory factor analysis (EFA or reliability analysis. The SPSS FACTOR procedure allows users to select listwise deletion, pairwise deletion or mean substitution as a method for dealing with missing data. The shortcomings of these methods are well-known. Graham (2009 argues that a much better way to deal with missing data in this context is to use a matrix of expectation maximization (EM covariances(or correlations as input for the analysis. SPSS users who have the Missing Values Analysis add-on module can obtain vectors ofEM means and standard deviations plus EM correlation and covariance matrices via the MVA procedure. But unfortunately, MVA has no /MATRIX subcommand, and therefore cannot write the EM correlations directly to a matrix dataset of the type needed as input to the FACTOR and RELIABILITY procedures. We describe two macros that (in conjunction with an intervening MVA command carry out the data management steps needed to create two matrix datasets, one containing EM correlations and the other EM covariances. Either of those matrix datasets can then be used asinput to the FACTOR procedure, and the EM correlations can also be used as input to RELIABILITY. We provide an example that illustrates the use of the two macros to generate the matrix datasets and how to use those datasets as input to the FACTOR and RELIABILITY procedures. We hope that this simple method for handling missing data will prove useful to both students andresearchers who are conducting EFA or reliability analysis.

  10. Missing and Spurious Level Corrections for Nuclear Resonances

    International Nuclear Information System (INIS)

    Mitchell, G E; Agvaanluvsan, U; Pato, M P; Shriner, J F

    2005-01-01

    Neutron and proton resonances provide detailed level density information. However, due to experimental limitations, some levels are missed and some are assigned incorrect quantum numbers. The standard method to correct for missing levels uses the experimental widths and the Porter-Thomas distribution. Analysis of the spacing distribution provides an independent determination of the fraction of missing levels. We have derived a general expression for such an imperfect spacing distribution using the maximum entropy principle and applied it to a variety of nuclear resonance data. The problem of spurious levels has not been extensively addressed

  11. The Human Microbiome and the Missing Heritability Problem

    Directory of Open Access Journals (Sweden)

    Santiago Sandoval-Motta

    2017-06-01

    Full Text Available The “missing heritability” problem states that genetic variants in Genome-Wide Association Studies (GWAS cannot completely explain the heritability of complex traits. Traditionally, the heritability of a phenotype is measured through familial studies using twins, siblings and other close relatives, making assumptions on the genetic similarities between them. When this heritability is compared to the one obtained through GWAS for the same traits, a substantial gap between both measurements arise with genome wide studies reporting significantly smaller values. Several mechanisms for this “missing heritability” have been proposed, such as epigenetics, epistasis, and sequencing depth. However, none of them are able to fully account for this gap in heritability. In this paper we provide evidence that suggests that in order for the phenotypic heritability of human traits to be broadly understood and accounted for, the compositional and functional diversity of the human microbiome must be taken into account. This hypothesis is based on several observations: (A The composition of the human microbiome is associated with many important traits, including obesity, cancer, and neurological disorders. (B Our microbiome encodes a second genome with nearly a 100 times more genes than the human genome, and this second genome may act as a rich source of genetic variation and phenotypic plasticity. (C Human genotypes interact with the composition and structure of our microbiome, but cannot by themselves explain microbial variation. (D Microbial genetic composition can be strongly influenced by the host's behavior, its environment or by vertical and horizontal transmissions from other hosts. Therefore, genetic similarities assumed in familial studies may cause overestimations of heritability values. We also propose a method that allows the compositional and functional diversity of our microbiome to be incorporated to genome wide association studies.

  12. Association of the Nurse Work Environment, Collective Efficacy, and Missed Care.

    Science.gov (United States)

    Smith, Jessica G; Morin, Karen H; Wallace, Leigh E; Lake, Eileen T

    2018-06-01

    Missed nursing care is a significant threat to quality patient care. Promoting collective efficacy within nurse work environments could decrease missed care. The purpose was to understand how missed care is associated with nurse work environments and collective efficacy of hospital staff nurses. A cross-sectional, convenience sample was obtained through online surveys from registered nurses working at five southwestern U.S. hospitals. Descriptive, correlational, regression, and path analyses were conducted ( N = 233). The percentage of nurses who reported that at least one care activity was missed frequently or always was 94%. Mouth care (36.0% of nurses) and ambulation (35.3%) were missed frequently or always. Nurse work environments and collective efficacy were moderately, positively correlated. Nurse work environments and collective efficacy were associated with less missed care (χ 2 = 10.714, p = .0054). Fostering collective efficacy in the nurse work environment could reduce missed care and improve patient outcomes.

  13. Sales-marketing interface and company performance. Is information use the missing link?

    OpenAIRE

    Keszey, Tamara

    2013-01-01

    Over the last couple of years there has been an ongoing debate on how sales managers contribute to organizational value. Direct measures between sales-marketing interface quality and company performance are compromised, as company performance is influenced by a plethora of other factors. We advocate that the use of sales information is the missing link between sales-marketing relationship quality and organizational outcomes. We propose and empirically test a model on how sales-mar...

  14. Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters.

    Science.gov (United States)

    Hensman, James; Lawrence, Neil D; Rattray, Magnus

    2013-08-20

    Time course data from microarrays and high-throughput sequencing experiments require simple, computationally efficient and powerful statistical models to extract meaningful biological signal, and for tasks such as data fusion and clustering. Existing methodologies fail to capture either the temporal or replicated nature of the experiments, and often impose constraints on the data collection process, such as regularly spaced samples, or similar sampling schema across replications. We propose hierarchical Gaussian processes as a general model of gene expression time-series, with application to a variety of problems. In particular, we illustrate the method's capacity for missing data imputation, data fusion and clustering.The method can impute data which is missing both systematically and at random: in a hold-out test on real data, performance is significantly better than commonly used imputation methods. The method's ability to model inter- and intra-cluster variance leads to more biologically meaningful clusters. The approach removes the necessity for evenly spaced samples, an advantage illustrated on a developmental Drosophila dataset with irregular replications. The hierarchical Gaussian process model provides an excellent statistical basis for several gene-expression time-series tasks. It has only a few additional parameters over a regular GP, has negligible additional complexity, is easily implemented and can be integrated into several existing algorithms. Our experiments were implemented in python, and are available from the authors' website: http://staffwww.dcs.shef.ac.uk/people/J.Hensman/.

  15. Multinomial Logistic Regression & Bootstrapping for Bayesian Estimation of Vertical Facies Prediction in Heterogeneous Sandstone Reservoirs

    Science.gov (United States)

    Al-Mudhafar, W. J.

    2013-12-01

    Precisely prediction of rock facies leads to adequate reservoir characterization by improving the porosity-permeability relationships to estimate the properties in non-cored intervals. It also helps to accurately identify the spatial facies distribution to perform an accurate reservoir model for optimal future reservoir performance. In this paper, the facies estimation has been done through Multinomial logistic regression (MLR) with respect to the well logs and core data in a well in upper sandstone formation of South Rumaila oil field. The entire independent variables are gamma rays, formation density, water saturation, shale volume, log porosity, core porosity, and core permeability. Firstly, Robust Sequential Imputation Algorithm has been considered to impute the missing data. This algorithm starts from a complete subset of the dataset and estimates sequentially the missing values in an incomplete observation by minimizing the determinant of the covariance of the augmented data matrix. Then, the observation is added to the complete data matrix and the algorithm continues with the next observation with missing values. The MLR has been chosen to estimate the maximum likelihood and minimize the standard error for the nonlinear relationships between facies & core and log data. The MLR is used to predict the probabilities of the different possible facies given each independent variable by constructing a linear predictor function having a set of weights that are linearly combined with the independent variables by using a dot product. Beta distribution of facies has been considered as prior knowledge and the resulted predicted probability (posterior) has been estimated from MLR based on Baye's theorem that represents the relationship between predicted probability (posterior) with the conditional probability and the prior knowledge. To assess the statistical accuracy of the model, the bootstrap should be carried out to estimate extra-sample prediction error by randomly

  16. Motorola Xoom The Missing Manual

    CERN Document Server

    Gralla, Preston

    2011-01-01

    Motorola Xoom is the first tablet to rival the iPad, and no wonder with all of the great features packed into this device. But learning how to use everything can be tricky-and Xoom doesn't come with a printed guide. That's where this Missing Manual comes in. Gadget expert Preston Gralla helps you master your Xoom with step-by-step instructions and clear explanations. As with all Missing Manuals, this book offers refreshing, jargon-free prose and informative illustrations. Use your Xoom as an e-book reader, music player, camcorder, and phoneKeep in touch with email, video and text chat, and so

  17. Measuring the safeguards value of material accountability

    International Nuclear Information System (INIS)

    Sicherman, A.

    1988-01-01

    Material accountability (MA) activities focus on providing after-the-fact indication of diversion or theft of special nuclear material (SNM). MA activities include maintaining records for tracking nuclear material and conducting periodic inventories and audits to ensure that loss has not occurred. This paper presents a value model concept for assessing the safeguards benefits of MA activities and for comparing these benefits to those provided by physical protection (PP) and material control (MC) components. The model considers various benefits of MA, which include: 1) providing information to assist in recovery of missing material, 2) providing assurance that physical protection and material control systems have been working, 3) defeating protracted theft attempts, and 4) properly resolving causes of and responding appropriately to anomalies of missing material and external alarms (e.g., hoax). Such a value model can aid decision-makers in allocating safeguards resources among PP, MC, and MA systems

  18. Missing School Matters

    Science.gov (United States)

    Balfanz, Robert

    2016-01-01

    Results of a survey conducted by the Office for Civil Rights show that 6 million public school students (13%) are not attending school regularly. Chronic absenteeism--defined as missing more than 10% of school for any reason--has been negatively linked to many key academic outcomes. Evidence shows that students who exit chronic absentee status can…

  19. Miss World going deshi

    DEFF Research Database (Denmark)

    Wildermuth, Norbert

      In November 1996 the South Indian metropolis Bangalore hosted the annual Miss World show. The live event and its televisualisation became a prominent symbol for the India's economic liberalisation and for the immanent globalizing dimensions of this development. As such, the highly prestigious......, the pageant's contestation, which gave rise to a series of vehement protests and a broad public debate about the country's cultural alienation, marked a crucial point in time and trend towards the (re)localisation of the Indian television landscape. In consequence, the 1996 Miss World show and its...... their vision and politics of gender, nation and modernity on the larger Indian public, over the last two decades. Engaging the Indian population increasingly by way of the new electronic, c&s distributed media, competing discourses of gender and sexuality were projected, basically as a necessary, effective...

  20. Inadequate environment, resources and values lead to missed nursing care: A focused ethnographic study on the surgical ward using the Fundamentals of Care framework.

    Science.gov (United States)

    Jangland, Eva; Teodorsson, Therese; Molander, Karin; Muntlin Athlin, Åsa

    2018-06-01

    To explore the delivery of care from the perspective of patients with acute abdominal pain focusing on the contextual factors at system level using the Fundamentals of Care framework. The Fundamentals of Care framework describes several contextual and systemic factors that can impact the delivery of care. To deliver high-quality, person-centred care, it is important to understand how these factors affect patients' experiences and care needs. A focused ethnographic approach. A total of 20 observations were performed on two surgical wards at a Swedish university hospital. Data were collected using participant observation and informal interviews and analysed using deductive content analysis. The findings, presented in four categories, reflect the value patients place on the caring relationship and a friendly atmosphere on the ward. Patients had concerns about the environment, particularly the high-tempo culture on the ward and its impact on their integrity, rest and sleep, access to information and planning, and need for support in addressing their existential thoughts. The observers also noted that missed nursing care had serious consequences for patient safety. Patients with acute abdominal pain were cared for in the high-tempo culture of a surgical ward with limited resources, unclear leadership and challenges to patients' safety. The findings highlight the crucial importance of prioritising and valuing the patients' fundamental care needs for recovery. Nursing leaders and nurses need to take the lead to reconceptualise the value of fundamental care in the acute care setting. To improve clinical practice, the value of fundamentals of care must be addressed regardless of patient's clinical condition. Providing a caring relationship is paramount to ensure a positive impact on patient's well-being and recovery. © 2017 John Wiley & Sons Ltd.

  1. Topology-Based Estimation of Missing Smart Meter Readings

    Directory of Open Access Journals (Sweden)

    Daisuke Kodaira

    2018-01-01

    Full Text Available Smart meters often fail to measure or transmit the data they record when measuring energy consumption, known as meter readings, owing to faulty measuring equipment or unreliable communication modules. Existing studies do not address successive and non-periodical missing meter readings. This paper proposes a method whereby missing readings observed at a node are estimated by using circuit theory principles that leverage the voltage and current data from adjacent nodes. A case study is used to demonstrate the ability of the proposed method to successfully estimate the missing readings over an entire day during which outages and unpredictable perturbations occurred.

  2. Missed nursing care and its relationship with confidence in delegation among hospital nurses.

    Science.gov (United States)

    Saqer, Tahani J; AbuAlRub, Raeda F

    2018-04-06

    To (i) identify the types and reasons for missed nursing care among Jordanian hospital nurses; (ii) identify predictors of missed nursing care based on study variables; and (iii) examine the relationship between nurses' confidence in delegation and missed nursing care. Missed nursing care is a global concern for nurses and nurse administrators. Investigating the relation between the confidence in delegation and missed nursing care might help in designing strategies that enable nurses to minimise missed care and enhance quality of services. A correlational research design was used for this study. A convenience sample of 362 hospital nurses completed the missed nursing care survey, and confidence and intent to delegate scale. The results of the study revealed that ambulating and feeding patients on time, doing mouth care and attending interdisciplinary care conferences were the most frequent types of missed care. The mean score for missed nursing care was (2.78) on a scale from 1-5. The most prevalent reasons for missed care were "labour resources, followed by material resources, and then communication". Around 45% of the variation in the perceived level of "missed nursing care" was explained by background variables and perceived reasons for missed nursing. However, the relationship between confidence in delegation and missed care was insignificant. The results of this study add to the body of international literature on most prevalent types and reasons for missed nursing care in a different cultural context. Highlighting most prevalent reasons for missed nursing care could help nurse administrators in designing responsive strategies to eliminate or reduces such reasons. © 2018 John Wiley & Sons Ltd.

  3. MISSE PEACE Polymers Atomic Oxygen Erosion Results

    Science.gov (United States)

    deGroh, Kim, K.; Banks, Bruce A.; McCarthy, Catherine E.; Rucker, Rochelle N.; Roberts, Lily M.; Berger, Lauren A.

    2006-01-01

    Forty-one different polymer samples, collectively called the Polymer Erosion and Contamination Experiment (PEACE) Polymers, have been exposed to the low Earth orbit (LEO) environment on the exterior of the International Space Station (ISS) for nearly 4 years as part of Materials International Space Station Experiment 2 (MISSE 2). The objective of the PEACE Polymers experiment was to determine the atomic oxygen erosion yield of a wide variety of polymeric materials after long term exposure to the space environment. The polymers range from those commonly used for spacecraft applications, such as Teflon (DuPont) FEP, to more recently developed polymers, such as high temperature polyimide PMR (polymerization of monomer reactants). Additional polymers were included to explore erosion yield dependence upon chemical composition. The MISSE PEACE Polymers experiment was flown in MISSE Passive Experiment Carrier 2 (PEC 2), tray 1, on the exterior of the ISS Quest Airlock and was exposed to atomic oxygen along with solar and charged particle radiation. MISSE 2 was successfully retrieved during a space walk on July 30, 2005, during Discovery s STS-114 Return to Flight mission. Details on the specific polymers flown, flight sample fabrication, pre-flight and post-flight characterization techniques, and atomic oxygen fluence calculations are discussed along with a summary of the atomic oxygen erosion yield results. The MISSE 2 PEACE Polymers experiment is unique because it has the widest variety of polymers flown in LEO for a long duration and provides extremely valuable erosion yield data for spacecraft design purposes.

  4. Missed retinal breaks in rhegmatogenous retinal detachment

    Directory of Open Access Journals (Sweden)

    Brijesh Takkar

    2016-12-01

    Full Text Available AIM: To evaluate the causes and associations of missed retinal breaks (MRBs and posterior vitreous detachment (PVD in patients with rhegmatogenous retinal detachment (RRD. METHODS: Case sheets of patients undergoing vitreo retinal surgery for RRD at a tertiary eye care centre were evaluated retrospectively. Out of the 378 records screened, 253 were included for analysis of MRBs and 191 patients were included for analysis of PVD, depending on the inclusion criteria. Features of RRD and retinal breaks noted on examination were compared to the status of MRBs and PVD detected during surgery for possible associations. RESULTS: Overall, 27% patients had MRBs. Retinal holes were commonly missed in patients with lattice degeneration while missed retinal tears were associated with presence of complete PVD. Patients operated for cataract surgery were significantly associated with MRBs (P=0.033 with the odds of missing a retinal break being 1.91 as compared to patients with natural lens. Advanced proliferative vitreo retinopathy (PVR and retinal bullae were the most common reasons for missing a retinal break during examination. PVD was present in 52% of the cases and was wrongly assessed in 16%. Retinal bullae, pseudophakia/aphakia, myopia, and horse shoe retinal tears were strongly associated with presence of PVD. Traumatic RRDs were rarely associated with PVD. CONCLUSION: Pseudophakic patients, and patients with retinal bullae or advanced PVR should be carefully screened for MRBs. Though Weiss ring is a good indicator of PVD, it may still be over diagnosed in some cases. PVD is associated with retinal bullae and pseudophakia, and inversely with traumatic RRD.

  5. Miss Julie: A Psychoanalytic Study

    Directory of Open Access Journals (Sweden)

    Sonali Jain

    2015-10-01

    Full Text Available Sigmund Freud theorized that ‘the hero of the tragedy must suffer…to bear the burden of tragic guilt…(that lay in rebellion against some divine or human authority.’ August Strindberg, the Swedish poet, playwright, author and visual artist, like Shakespeare before him, portrayed insanity as the ultimate of tragic conflict. In this paper I seek to explore and reiterate the dynamics of human relationships that are as relevant today as they were in Strindberg’s time. I propose to examine Strindberg’s Miss Julie, a play set in nineteenth century Sweden, through a psychoanalytic lens. The play deals with bold themes of class and sexual identity politics. Notwithstanding the progress made in breaking down gender barriers, the inequalities inherent in a patriarchal system persist in modern society. Miss Julie highlights these imbalances. My analysis of the play deals with issues of culture and psyche, and draws on Freud, Melanie Klein, Lacan, Luce Irigaray and other contemporary feminists. Miss Julie is a discourse on hysteria, which is still pivotal to psychoanalysis. Prominent philosophers like Hegel and the psychoanalyst Jacques Lacan have written about the dialectic of the master and the slave – a relationship that is characterized by dependence, demand and cruelty. The history of human civilization shows beyond any doubt that there is an intimate connection between cruelty and the sexual instinct. An analysis of the text is carried out using the sado-masochistic dynamic as well the slave-master discourse. I argue that Miss Julie subverts the slave-master relationship. The struggle for dominance and power is closely linked with the theme of sexuality in the unconscious. To quote the English actor and director Alan Rickman, ‘Watching or working on the plays of Strindberg is like seeing the skin, flesh and bones of life separated from each other. Challenging and timeless.’

  6. Materials International Space Station Experiment (MISSE): Overview, Accomplishments and Future Needs

    Science.gov (United States)

    deGroh, Kim K.; Jaworske, Donald A.; Pippin, Gary; Jenkins, Philip P.; Walters, Robert J.; Thibeault, Sheila A.; Palusinski, Iwona; Lorentzen, Justin R.

    2014-01-01

    8, yielding long-duration space environmental performance and durability data that enable material validation, processing recertification and space qualification; improved predictions of materials and component lifetimes in space; model verification and development; and correlation factors between space-exposure and ground-facilities enabling more accurate in-space performance predictions based on ground-laboratory testing. A few of the many experiment results and observations, and their impacts, are provided. Those highlighted include examples on improved understanding of atomic oxygen scattering mechanisms, LEO coating durability results, and polymer erosion yields and their impacts on spacecraft design. The MISSE 2 Atomic Oxygen Scattering Chamber Experiment discovered that the peak flux of scattered AO was determined to be 45 deg from normal incidence, not the model predicted cosine dependence. In addition, the erosion yield (E(sub y)) of Kapton H for AO scattered off oxidized-Al is 22% of the E(sub y) of direct AO impingement. These results were used to help determine the degradation mechanism of a cesium iodide detector within the Hubble Space Telescope Cosmic Origins Spectrograph Experiment. The MISSE 6 Indium Tin Oxide (ITO) Degradation Experiment measured surface electrical resistance of ram and wake ITO coated samples. The data confirmed that ITO is a stable AO protective coating, and the results validated the durability of ITO conductive coatings for solar arrays for the Atmosphere-Space Transition 2 Explorer program. The MISSE 2, 6 and 7 Polymer Experiments have provided LEO AO Ey data on over 120 polymer and composites samples. The flight E(sub y) values were found to range from 3.05 x 10(exp -26) cu cm/atom for the AO resistant polymer CORIN to 9.14 x 10(exp -26) cu cm/atom for polyoxymethylene (POM). In addition, flying the same polymers on different missions has advanced the understanding of the AO E(sub y) dependency on solar exposure for polymers

  7. Missed Lung Cancers on the Scout View: Do We Look Every Time?

    Directory of Open Access Journals (Sweden)

    Sarfraz Ahmed Nazir

    2013-01-01

    Full Text Available Scout views are digital radiographs obtained to aid planning of the subsequent computed tomography (CT examination. Review of these scout views may provide additional information not demonstrated on the axial images, but such reviews may not necessarily be performed routinely, especially in the context of abdominopelvic CT studies. We illustrate the value of the scout images by presenting a series of representative cases of missed pulmonary neoplasms in five patients who originally underwent such examinations.

  8. The Missing Entrepreneurs 2014

    DEFF Research Database (Denmark)

    Halabisky, David; Potter, Jonathan; Thompson, Stuart

    OECD's LEED Programme and the European Commission's DG on Employment, Social Affairs and Inclusion recently published the second book as part of their programme of work on inclusive entrepreneurship. The Missing Entrepreneurs 2014 examines how public policies at national and local levels can...

  9. 9 CFR 2.128 - Inspection for missing animals.

    Science.gov (United States)

    2010-01-01

    ... 9 Animals and Animal Products 1 2010-01-01 2010-01-01 false Inspection for missing animals. 2.128 Section 2.128 Animals and Animal Products ANIMAL AND PLANT HEALTH INSPECTION SERVICE, DEPARTMENT OF AGRICULTURE ANIMAL WELFARE REGULATIONS Miscellaneous § 2.128 Inspection for missing animals. Each dealer...

  10. Ultrasonographic findings of hydatidiform mole and missed abortion

    International Nuclear Information System (INIS)

    Yun, Kwang Myeong; Lee, Yeong Hwan; Chung, Hye Kyeong; Chung, Duck Soo; Kim, Ok Dong

    1990-01-01

    To establish the sonographic characteristics of the hydatidiform mole and the missed abortion with placental degeneration, we have retrospectively analyzed 12 cases of complete mole, 10 cases of partial mole, and 10 cases of missed abortion with placental hydropic degeneration, collected at Taegu Catholic General Hospital, from January 1986 to December 1989. The results were as follows : 1. Of 12 cases of complete mole, all demonstrated diffuse intrauterine vesicular pattern of internal echo without a gestational sac. Two cases were recurred after D and E. 2. The partial mole was characterized by focal (70%) or diffuse (20%) distribution of hydatidiform placental change and a gestational sac (100%) with or without a macerated fetus. But the striking hydatidiform placental change was not present in one cases of partial mole. 3. The uterus was larger for dates in 9 cases (90%) of complete mole, but smaller for dates in 7 cases (70%) of partial mole. 4. The missed abortion with placental hydropic degeneration was indistinguished from a partial mole due to their similar sonographic appearance : focal or diffuse cystic change of a placenta, a distorted gestational sac with or without a fetus, and a smaller uterus for dates. On conclusion, the complete mole could be easily distinguished from a partial mole or a missed abortion by sonography : a gestational sac or an area of noncystic placenta was not identified in a complete mole. The partial mole was indistinguished from a missed abortion, but if there is the suspicion of trophoblastic proliferation, such as a convex placental surface or a larger uterus for dates, then the diagnosis is probably a partial mole rather than a missed abortion

  11. Little Evidence That Time in Child Care Causes Externalizing Problems During Early Childhood in Norway

    Science.gov (United States)

    Zachrisson, Henrik Daae; Dearing, Eric; Lekhal, Ratib; Toppelberg, Claudio O.

    2012-01-01

    Associations between maternal reports of hours in child care and children’s externalizing problems at 18 and 36 months of age were examined in a population-based Norwegian sample (n = 75,271). Within a sociopolitical context of homogenously high-quality child care, there was little evidence that high quantity of care causes externalizing problems. Using conventional approaches to handling selection bias and listwise deletion for substantial attrition in this sample, more hours in care predicted higher problem levels, yet with small effect sizes. The finding, however, was not robust to using multiple imputation for missing values. Moreover, when sibling and individual fixed-effects models for handling selection bias were used, no relation between hours and problems was evident. PMID:23311645

  12. Factors influencing the missed nursing care in patients from a private hospital

    Directory of Open Access Journals (Sweden)

    Raúl Hernández-Cruz

    Full Text Available ABSTRACT Objective: to determine the factors that influence the missed nursing care in hospitalized patients. Methods: descriptive correlational study developed at a private hospital in Mexico. To identify the missed nursing care and related factors, the MISSCARE survey was used, which measures the care missed and associated factors. The care missed and the factors were grouped in global and dimension rates. For the analysis, descriptive statistics, Spearman’s correlation and simple linear regression were used. Approval for the study was obtained from the ethics committee. Results: the participants were 71 nurses from emergency, intensive care and inpatient services. The global missed care index corresponded to M=7.45 (SD=10.74; the highest missed care index was found in the dimension basic care interventions (M=13.02, SD=17.60. The main factor contributing to the care missed was human resources (M=56.13, SD=21.38. The factors related to the care missed were human resources (rs=0.408, p<0.001 and communication (rs=0.418, p<0.001. Conclusions: the nursing care missed is mainly due to the human resource factor; these study findings will permit the strengthening of nursing care continuity.

  13. Combination of individual tree detection and area-based approach in imputation of forest variables using airborne laser data

    Science.gov (United States)

    Vastaranta, Mikko; Kankare, Ville; Holopainen, Markus; Yu, Xiaowei; Hyyppä, Juha; Hyyppä, Hannu

    2012-01-01

    The two main approaches to deriving forest variables from laser-scanning data are the statistical area-based approach (ABA) and individual tree detection (ITD). With ITD it is feasible to acquire single tree information, as in field measurements. Here, ITD was used for measuring training data for the ABA. In addition to automatic ITD (ITD auto), we tested a combination of ITD auto and visual interpretation (ITD visual). ITD visual had two stages: in the first, ITD auto was carried out and in the second, the results of the ITD auto were visually corrected by interpreting three-dimensional laser point clouds. The field data comprised 509 circular plots ( r = 10 m) that were divided equally for testing and training. ITD-derived forest variables were used for training the ABA and the accuracies of the k-most similar neighbor ( k-MSN) imputations were evaluated and compared with the ABA trained with traditional measurements. The root-mean-squared error (RMSE) in the mean volume was 24.8%, 25.9%, and 27.2% with the ABA trained with field measurements, ITD auto, and ITD visual, respectively. When ITD methods were applied in acquiring training data, the mean volume, basal area, and basal area-weighted mean diameter were underestimated in the ABA by 2.7-9.2%. This project constituted a pilot study for using ITD measurements as training data for the ABA. Further studies are needed to reduce the bias and to determine the accuracy obtained in imputation of species-specific variables. The method could be applied in areas with sparse road networks or when the costs of fieldwork must be minimized.

  14. The effect of tertiary surveys on missed injuries in trauma: a systematic review

    Directory of Open Access Journals (Sweden)

    Keijzers Gerben B

    2012-11-01

    Full Text Available Abstract Background Trauma tertiary surveys (TTS are advocated to reduce the rate of missed injuries in hospitalized trauma patients. Moreover, the missed injury rate can be a quality indicator of trauma care performance. Current variation of the definition of missed injury restricts interpretation of the effect of the TTS and limits the use of missed injury for benchmarking. Only a few studies have specifically assessed the effect of the TTS on missed injury. We aimed to systematically appraise these studies using outcomes of two common definitions of missed injury rates and long-term health outcomes. Methods A systematic review was performed. An electronic search (without language or publication restrictions of the Cochrane Library, Medline and Ovid was used to identify studies assessing TTS with short-term measures of missed injuries and long-term health outcomes. ‘Missed injury’ was defined as either: Type I any injury missed at primary and secondary survey and detected by the TTS; or Type II any injury missed at primary and secondary survey and missed by the TTS, detected during hospital stay. Two authors independently selected studies. Risk of bias for observational studies was assessed using the Newcastle-Ottawa scale. Results Ten observational studies met our inclusion criteria. None was randomized and none reported long-term health outcomes. Their risk of bias varied considerably. Nine studies assessed Type I missed injury and found an overall rate of 4.3%. A single study reported Type II missed injury with a rate of 1.5%. Three studies reported outcome data on missed injuries for both control and intervention cohorts, with two reporting an increase in Type I missed injuries (3% vs. 7%, PP=0.01. Conclusions Overall Type I and Type II missed injury rates were 4.3% and 1.5%. Routine TTS performance increased Type I and reduced Type II missed injuries. However, evidence is sub-optimal: few observational studies, non-uniform outcome

  15. Health insurance theory: the case of the missing welfare gain.

    Science.gov (United States)

    Nyman, John A

    2008-11-01

    An important source of value is missing from the conventional welfare analysis of moral hazard, namely, the effect of income transfers (from those who purchase insurance and remain healthy to those who become ill) on purchases of medical care. Income transfers are contained within the price reduction that is associated with standard health insurance. However, in contrast to the income effects contained within an exogenous price decrease, these income transfers act to shift out the demand for medical care. As a result, the consumer's willingness to pay for medical care increases and the resulting additional consumption is welfare increasing.

  16. Prevalence and Correlates of Missing Meals Among High School Students-United States, 2010.

    Science.gov (United States)

    Demissie, Zewditu; Eaton, Danice K; Lowry, Richard; Nihiser, Allison J; Foltz, Jennifer L

    2018-01-01

    To determine the prevalence and correlates of missing meals among adolescents. The 2010 National Youth Physical Activity and Nutrition Study, a cross-sectional study. School based. A nationally representative sample of 11 429 high school students. Breakfast, lunch, and dinner consumption; demographics; measured and perceived weight status; physical activity and sedentary behaviors; and fruit, vegetable, milk, sugar-sweetened beverage, and fast-food intake. Prevalence estimates for missing breakfast, lunch, or dinner on ≥1 day during the past 7 days were calculated. Associations between demographics and missing meals were tested. Associations of lifestyle and dietary behaviors with missing meals were examined using logistic regression controlling for sex, race/ethnicity, and grade. In 2010, 63.1% of students missed breakfast, 38.2% missed lunch, and 23.3% missed dinner; the prevalence was highest among female and non-Hispanic black students. Being overweight/obese, perceiving oneself to be overweight, and video game/computer use were associated with increased risk of missing meals. Physical activity behaviors were associated with reduced risk of missing meals. Students who missed breakfast were less likely to eat fruits and vegetables and more likely to consume sugar-sweetened beverages and fast food. Breakfast was the most frequently missed meal, and missing breakfast was associated with the greatest number of less healthy dietary practices. Intervention and education efforts might prioritize breakfast consumption.

  17. 78 FR 2256 - Extension of the Extended Missing Parts Pilot Program

    Science.gov (United States)

    2013-01-10

    ...] Extension of the Extended Missing Parts Pilot Program AGENCY: United States Patent and Trademark Office... pilot program (Extended Missing Parts Pilot Program) in which an applicant, under certain conditions... nonprovisional application. The Extended Missing Parts Pilot Program benefits applicants by permitting additional...

  18. Missed Opportunities for Hepatitis A Vaccination, National Immunization Survey-Child, 2013.

    Science.gov (United States)

    Casillas, Shannon M; Bednarczyk, Robert A

    2017-08-01

    To quantify the number of missed opportunities for vaccination with hepatitis A vaccine in children and assess the association of missed opportunities for hepatitis A vaccination with covariates of interest. Weighted data from the 2013 National Immunization Survey of US children aged 19-35 months were used. Analysis was restricted to children with provider-verified vaccination history (n = 13 460). Missed opportunities for vaccination were quantified by determining the number of medical visits a child made when another vaccine was administered during eligibility for hepatitis A vaccine, but hepatitis A vaccine was not administered. Cross-sectional bivariate and multivariate polytomous logistic regression were used to assess the association of missed opportunities for vaccination with child and maternal demographic, socioeconomic, and geographic covariates. In 2013, 85% of children in our study population had initiated the hepatitis A vaccine series, and 60% received 2 or more doses. Children who received zero doses of hepatitis A vaccine had an average of 1.77 missed opportunities for vaccination compared with 0.43 missed opportunities for vaccination in those receiving 2 doses. Children with 2 or more missed opportunities for vaccination initiated the vaccine series 6 months later than children without missed opportunities. In the fully adjusted multivariate model, children who were younger, had ever received WIC benefits, or lived in a state with childcare entry mandates were at a reduced odds for 2 or more missed opportunities for vaccination; children living in the Northeast census region were at an increased odds. Missed opportunities for vaccination likely contribute to the poor coverage for hepatitis A vaccination in children; it is important to understand why children are not receiving the vaccine when eligible. Copyright © 2017 Elsevier Inc. All rights reserved.

  19. Semiparametric approach for non-monotone missing covariates in a parametric regression model

    KAUST Repository

    Sinha, Samiran

    2014-02-26

    Missing covariate data often arise in biomedical studies, and analysis of such data that ignores subjects with incomplete information may lead to inefficient and possibly biased estimates. A great deal of attention has been paid to handling a single missing covariate or a monotone pattern of missing data when the missingness mechanism is missing at random. In this article, we propose a semiparametric method for handling non-monotone patterns of missing data. The proposed method relies on the assumption that the missingness mechanism of a variable does not depend on the missing variable itself but may depend on the other missing variables. This mechanism is somewhat less general than the completely non-ignorable mechanism but is sometimes more flexible than the missing at random mechanism where the missingness mechansim is allowed to depend only on the completely observed variables. The proposed approach is robust to misspecification of the distribution of the missing covariates, and the proposed mechanism helps to nullify (or reduce) the problems due to non-identifiability that result from the non-ignorable missingness mechanism. The asymptotic properties of the proposed estimator are derived. Finite sample performance is assessed through simulation studies. Finally, for the purpose of illustration we analyze an endometrial cancer dataset and a hip fracture dataset.

  20. Empirical Likelihood in Nonignorable Covariate-Missing Data Problems.

    Science.gov (United States)

    Xie, Yanmei; Zhang, Biao

    2017-04-20

    Missing covariate data occurs often in regression analysis, which frequently arises in the health and social sciences as well as in survey sampling. We study methods for the analysis of a nonignorable covariate-missing data problem in an assumed conditional mean function when some covariates are completely observed but other covariates are missing for some subjects. We adopt the semiparametric perspective of Bartlett et al. (Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics 2014;15:719-30) on regression analyses with nonignorable missing covariates, in which they have introduced the use of two working models, the working probability model of missingness and the working conditional score model. In this paper, we study an empirical likelihood approach to nonignorable covariate-missing data problems with the objective of effectively utilizing the two working models in the analysis of covariate-missing data. We propose a unified approach to constructing a system of unbiased estimating equations, where there are more equations than unknown parameters of interest. One useful feature of these unbiased estimating equations is that they naturally incorporate the incomplete data into the data analysis, making it possible to seek efficient estimation of the parameter of interest even when the working regression function is not specified to be the optimal regression function. We apply the general methodology of empirical likelihood to optimally combine these unbiased estimating equations. We propose three maximum empirical likelihood estimators of the underlying regression parameters and compare their efficiencies with other existing competitors. We present a simulation study to compare the finite-sample performance of various methods with respect to bias, efficiency, and robustness to model misspecification. The proposed empirical likelihood method is also illustrated by an analysis of a data set from the US National Health and