Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.
Full Text Available Abstract Background Meta-analysis (MA is widely used to pool genome-wide association studies (GWASes in order to a increase the power to detect strong or weak genotype effects or b as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia to avoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software, however, enables cross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming. Results Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying on imputation. This is accomplished by using reference linkage disequilibrium data from 1,000 Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least one study. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Our algorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applications can be used as input files for MA, tremendously speeding up MA compared to the conventional imputation approach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possibly providing an incentive for follow-up studies. We propose our method as a quick screening step prior to imputation-based MA, as well as an additional main approach for studies without available reference data matching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II Diabetes GWAS and found that the proxy algorithm clearly outperforms naïve MA on the p-value level: for 17 out of 23 we observe an improvement on the p-value level by a factor of more than two, and a maximum improvement by a factor of 2127. Conclusions YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventional MA as well as inserting proxy-SNPs
He, Jun; Xu, Jiaqi; Wu, Xiao-Lin; Bauck, Stewart; Lee, Jungjae; Morota, Gota; Kachman, Stephen D; Spangler, Matthew L
SNP chips are commonly used for genotyping animals in genomic selection but strategies for selecting low-density (LD) SNPs for imputation-mediated genomic selection have not been addressed adequately. The main purpose of the present study was to compare the performance of eight LD (6K) SNP panels, each selected by a different strategy exploiting a combination of three major factors: evenly-spaced SNPs, increased minor allele frequencies, and SNP-trait associations either for single traits independently or for all the three traits jointly. The imputation accuracies from 6K to 80K SNP genotypes were between 96.2 and 98.2%. Genomic prediction accuracies obtained using imputed 80K genotypes were between 0.817 and 0.821 for daughter pregnancy rate, between 0.838 and 0.844 for fat yield, and between 0.850 and 0.863 for milk yield. The two SNP panels optimized on the three major factors had the highest genomic prediction accuracy (0.821-0.863), and these accuracies were very close to those obtained using observed 80K genotypes (0.825-0.868). Further exploration of the underlying relationships showed that genomic prediction accuracies did not respond linearly to imputation accuracies, but were significantly affected by genotype (imputation) errors of SNPs in association with the traits to be predicted. SNPs optimal for map coverage and MAF were favorable for obtaining accurate imputation of genotypes whereas trait-associated SNPs improved genomic prediction accuracies. Thus, optimal LD SNP panels were the ones that combined both strengths. The present results have practical implications on the design of LD SNP chips for imputation-enabled genomic prediction.
Dana B Hancock
Full Text Available Genotype imputation, used in genome-wide association studies to expand coverage of single nucleotide polymorphisms (SNPs, has performed poorly in African Americans compared to less admixed populations. Overall, imputation has typically relied on HapMap reference haplotype panels from Africans (YRI, European Americans (CEU, and Asians (CHB/JPT. The 1000 Genomes project offers a wider range of reference populations, such as African Americans (ASW, but their imputation performance has had limited evaluation. Using 595 African Americans genotyped on Illumina's HumanHap550v3 BeadChip, we compared imputation results from four software programs (IMPUTE2, BEAGLE, MaCH, and MaCH-Admix and three reference panels consisting of different combinations of 1000 Genomes populations (February 2012 release: (1 3 specifically selected populations (YRI, CEU, and ASW; (2 8 populations of diverse African (AFR or European (AFR descent; and (3 all 14 available populations (ALL. Based on chromosome 22, we calculated three performance metrics: (1 concordance (percentage of masked genotyped SNPs with imputed and true genotype agreement; (2 imputation quality score (IQS; concordance adjusted for chance agreement, which is particularly informative for low minor allele frequency [MAF] SNPs; and (3 average r2hat (estimated correlation between the imputed and true genotypes, for all imputed SNPs. Across the reference panels, IMPUTE2 and MaCH had the highest concordance (91%-93%, but IMPUTE2 had the highest IQS (81%-83% and average r2hat (0.68 using YRI+ASW+CEU, 0.62 using AFR+EUR, and 0.55 using ALL. Imputation quality for most programs was reduced by the addition of more distantly related reference populations, due entirely to the introduction of low frequency SNPs (MAF≤2% that are monomorphic in the more closely related panels. While imputation was optimized by using IMPUTE2 with reference to the ALL panel (average r2hat = 0.86 for SNPs with MAF>2%, use of the ALL
Full Text Available Abstract Background Genotype imputation is an important process of predicting unknown genotypes, which uses reference population with dense genotypes to predict missing genotypes for both human and animal genetic variations at a low cost. Machine learning methods specially boosting methods have been used in genetic studies to explore the underlying genetic profile of disease and build models capable of predicting missing values of a marker. Methods In this study strategies and factors affecting the imputation accuracy of parent-offspring trios compared from lower-density SNP panels (5 K to high density (10 K SNP panel using three different Boosting methods namely TotalBoost (TB, LogitBoost (LB and AdaBoost (AB. The methods employed using simulated data to impute the un-typed SNPs in parent-offspring trios. Four different datasets of G1 (100 trios with 5 k SNPs, G2 (100 trios with 10 k SNPs, G3 (500 trios with 5 k SNPs, and G4 (500 trio with 10 k SNPs were simulated. In four datasets all parents were genotyped completely, and offspring genotyped with a lower density panel. Results Comparison of the three methods for imputation showed that the LB outperformed AB and TB for imputation accuracy. The time of computation were different between methods. The AB was the fastest algorithm. The higher SNP densities resulted the increase of the accuracy of imputation. Larger trios (i.e. 500 was better for performance of LB and TB. Conclusions The conclusion is that the three methods do well in terms of imputation accuracy also the dense chip is recommended for imputation of parent-offspring trios.
Romaniuk, Helena; Patton, George C; Carlin, John B
Multiple imputation has entered mainstream practice for the analysis of incomplete data. We have used it extensively in a large Australian longitudinal cohort study, the Victorian Adolescent Health Cohort Study (1992-2008). Although we have endeavored to follow best practices, there is little published advice on this, and we have not previously examined the extent to which variations in our approach might lead to different results. Here, we examined sensitivity of analytical results to imputation decisions, investigating choice of imputation method, inclusion of auxiliary variables, omission of cases with excessive missing data, and approaches for imputing highly skewed continuous distributions that are analyzed as dichotomous variables. Overall, we found that decisions made about imputation approach had a discernible but rarely dramatic impact for some types of estimates. For model-based estimates of association, the choice of imputation method and decisions made to build the imputation model had little effect on results, whereas estimates of overall prevalence and prevalence stratified by subgroup were more sensitive to imputation method and settings. Multiple imputation by chained equations gave more plausible results than multivariate normal imputation for prevalence estimates but appeared to be more susceptible to numerical instability related to a highly skewed variable. © The Author 2014. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: email@example.com.
Full Text Available Abstract Background Although high-throughput genotyping arrays have made whole-genome association studies (WGAS feasible, only a small proportion of SNPs in the human genome are actually surveyed in such studies. In addition, various SNP arrays assay different sets of SNPs, which leads to challenges in comparing results and merging data for meta-analyses. Genome-wide imputation of untyped markers allows us to address these issues in a direct fashion. Methods 384 Caucasian American liver donors were genotyped using Illumina 650Y (Ilmn650Y arrays, from which we also derived genotypes from the Ilmn317K array. On these data, we compared two imputation methods: MACH and BEAGLE. We imputed 2.5 million HapMap Release22 SNPs, and conducted GWAS on ~40,000 liver mRNA expression traits (eQTL analysis. In addition, 200 Caucasian American and 200 African American subjects were genotyped using the Affymetrix 500 K array plus a custom 164 K fill-in chip. We then imputed the HapMap SNPs and quantified the accuracy by randomly masking observed SNPs. Results MACH and BEAGLE perform similarly with respect to imputation accuracy. The Ilmn650Y results in excellent imputation performance, and it outperforms Affx500K or Ilmn317K sets. For Caucasian Americans, 90% of the HapMap SNPs were imputed at 98% accuracy. As expected, imputation of poorly tagged SNPs (untyped SNPs in weak LD with typed markers was not as successful. It was more challenging to impute genotypes in the African American population, given (1 shorter LD blocks and (2 admixture with Caucasian populations in this population. To address issue (2, we pooled HapMap CEU and YRI data as an imputation reference set, which greatly improved overall performance. The approximate 40,000 phenotypes scored in these populations provide a path to determine empirically how the power to detect associations is affected by the imputation procedures. That is, at a fixed false discovery rate, the number of cis
Cattram D. Nguyen
Full Text Available Abstract Background Multiple imputation has become very popular as a general-purpose method for handling missing data. The validity of multiple-imputation-based analyses relies on the use of an appropriate model to impute the missing values. Despite the widespread use of multiple imputation, there are few guidelines available for checking imputation models. Analysis In this paper, we provide an overview of currently available methods for checking imputation models. These include graphical checks and numerical summaries, as well as simulation-based methods such as posterior predictive checking. These model checking techniques are illustrated using an analysis affected by missing data from the Longitudinal Study of Australian Children. Conclusions As multiple imputation becomes further established as a standard approach for handling missing data, it will become increasingly important that researchers employ appropriate model checking approaches to ensure that reliable results are obtained when using this method.
Nguyen, Cattram D; Carlin, John B; Lee, Katherine J
Multiple imputation has become very popular as a general-purpose method for handling missing data. The validity of multiple-imputation-based analyses relies on the use of an appropriate model to impute the missing values. Despite the widespread use of multiple imputation, there are few guidelines available for checking imputation models. In this paper, we provide an overview of currently available methods for checking imputation models. These include graphical checks and numerical summaries, as well as simulation-based methods such as posterior predictive checking. These model checking techniques are illustrated using an analysis affected by missing data from the Longitudinal Study of Australian Children. As multiple imputation becomes further established as a standard approach for handling missing data, it will become increasingly important that researchers employ appropriate model checking approaches to ensure that reliable results are obtained when using this method.
Peter K Joshi
Full Text Available The analysis of less common variants in genome-wide association studies promises to elucidate complex trait genetics but is hampered by low power to reliably detect association. We show that addition of population-specific exome sequence data to global reference data allows more accurate imputation, particularly of less common SNPs (minor allele frequency 1-10% in two very different European populations. The imputation improvement corresponds to an increase in effective sample size of 28-38%, for SNPs with a minor allele frequency in the range 1-3%.
Marti, Helena; Chavance, Michel
The usual methods for analyzing case-cohort studies rely on sometimes not fully efficient weighted estimators. Multiple imputation might be a good alternative because it uses all the data available and approximates the maximum partial likelihood estimator. This method is based on the generation of several plausible complete data sets, taking into account uncertainty about missing values. When the imputation model is correctly defined, the multiple imputation estimator is asymptotically unbiased and its variance is correctly estimated. We show that a correct imputation model must be estimated from the fully observed data (cases and controls), using the case status among the explanatory variable. To validate the approach, we analyzed case-cohort studies first with completely simulated data and then with case-cohort data sampled from two real cohorts. The analyses of simulated data showed that, when the imputation model was correct, the multiple imputation estimator was unbiased and efficient. The observed gain in precision ranged from 8 to 37 per cent for phase-1 variables and from 5 to 19 per cent for the phase-2 variable. When the imputation model was misspecified, the multiple imputation estimator was still more efficient than the weighted estimators but it was also slightly biased. The analyses of case-cohort data sampled from complete cohorts showed that even when no strong predictor of the phase-2 variable was available, the multiple imputation was unbiased, as precised as the weighted estimator for the phase-2 variable and slightly more precise than the weighted estimators for the phase-1 variables. However, the multiple imputation estimator was found to be biased when, because of interaction terms, some coefficients of the imputation model had to be estimated from small samples. Multiple imputation is an efficient technique for analyzing case-cohort data. Practically, we suggest building the analysis model using only the case-cohort data and weighted
Xiang, Tao; Christensen, Ole Fredslund; Legarra, Andres
Genotype imputation is commonly used as an initial step of genomic selection. Studies on humans, plants and ruminants suggested many factors would affect the performance of imputation. However, studies rarely investigated pigs, especially crossbred pigs. In this study, different scenarios...... SNPs. This dataset will be analyzed for genomic selection in a future study...... of imputation from 5K SNPs to 7K SNPs on Danish Landrace, Yorkshire, and crossbred Landrace-Yorkshire were compared. In conclusion, genotype imputation on crossbreds performs equally well as in purebreds, when parental breeds are used as the reference panel. When the size of reference is considerably large...
Walani, Salimah R; Cleland, Charles M
To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.
Kontopantelis, Evangelos; White, Ian R; Sperrin, Matthew; Buchan, Iain
Multiple imputation is frequently used to deal with missing data in healthcare research. Although it is known that the outcome should be included in the imputation model when imputing missing covariate values, it is not known whether it should be imputed. Similarly no clear recommendations exist on: the utility of incorporating a secondary outcome, if available, in the imputation model; the level of protection offered when data are missing not-at-random; the implications of the dataset size and missingness levels. We used realistic assumptions to generate thousands of datasets across a broad spectrum of contexts: three mechanisms of missingness (completely at random; at random; not at random); varying extents of missingness (20-80% missing data); and different sample sizes (1,000 or 10,000 cases). For each context we quantified the performance of a complete case analysis and seven multiple imputation methods which deleted cases with missing outcome before imputation, after imputation or not at all; included or did not include the outcome in the imputation models; and included or did not include a secondary outcome in the imputation models. Methods were compared on mean absolute error, bias, coverage and power over 1,000 datasets for each scenario. Overall, there was very little to separate multiple imputation methods which included the outcome in the imputation model. Even when missingness was quite extensive, all multiple imputation approaches performed well. Incorporating a secondary outcome, moderately correlated with the outcome of interest, made very little difference. The dataset size and the extent of missingness affected performance, as expected. Multiple imputation methods protected less well against missingness not at random, but did offer some protection. As long as the outcome is included in the imputation model, there are very small performance differences between the possible multiple imputation approaches: no outcome imputation, imputation or
Full Text Available We introduce a new framework for the analysis of association studies, designed to allow untyped variants to be more effectively and directly tested for association with a phenotype. The idea is to combine knowledge on patterns of correlation among SNPs (e.g., from the International HapMap project or resequencing data in a candidate region of interest with genotype data at tag SNPs collected on a phenotyped study sample, to estimate ("impute" unmeasured genotypes, and then assess association between the phenotype and these estimated genotypes. Compared with standard single-SNP tests, this approach results in increased power to detect association, even in cases in which the causal variant is typed, with the greatest gain occurring when multiple causal variants are present. It also provides more interpretable explanations for observed associations, including assessing, for each SNP, the strength of the evidence that it (rather than another correlated SNP is causal. Although we focus on association studies with quantitative phenotype and a relatively restricted region (e.g., a candidate gene, the framework is applicable and computationally practical for whole genome association studies. Methods described here are implemented in a software package, Bim-Bam, available from the Stephens Lab website http://stephenslab.uchicago.edu/software.html.
Jung, Jinhyouk; Harel, Ofer; Kang, Sangwook
In this paper, we consider fitting semiparametric additive hazards models for case-cohort studies using a multiple imputation approach. In a case-cohort study, main exposure variables are measured only on some selected subjects, but other covariates are often available for the whole cohort. We consider this as a special case of a missing covariate by design. We propose to employ a popular incomplete data method, multiple imputation, for estimation of the regression parameters in additive hazards models. For imputation models, an imputation modeling procedure based on a rejection sampling is developed. A simple imputation modeling that can naturally be applied to a general missing-at-random situation is also considered and compared with the rejection sampling method via extensive simulation studies. In addition, a misspecification aspect in imputation modeling is investigated. The proposed procedures are illustrated using a cancer data example. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
Pereira, Vania; Tomas Mas, Carmen; Amorim, António
The importance of X-chromosome markers in individual identifications, population genetics, forensics and kinship testing is getting wide recognition. In this work, we studied the distributions of 25 X-chromosome single nucleotide polymorphisms (X-SNPs) in population samples from Northern, Central...
Quartagno, M; Carpenter, J R
Recently, multiple imputation has been proposed as a tool for individual patient data meta-analysis with sporadically missing observations, and it has been suggested that within-study imputation is usually preferable. However, such within study imputation cannot handle variables that are completely missing within studies. Further, if some of the contributing studies are relatively small, it may be appropriate to share information across studies when imputing. In this paper, we develop and evaluate a joint modelling approach to multiple imputation of individual patient data in meta-analysis, with an across-study probability distribution for the study specific covariance matrices. This retains the flexibility to allow for between-study heterogeneity when imputing while allowing (i) sharing information on the covariance matrix across studies when this is appropriate, and (ii) imputing variables that are wholly missing from studies. Simulation results show both equivalent performance to the within-study imputation approach where this is valid, and good results in more general, practically relevant, scenarios with studies of very different sizes, non-negligible between-study heterogeneity and wholly missing variables. We illustrate our approach using data from an individual patient data meta-analysis of hypertension trials. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Clive J Hoggart
Full Text Available Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.
Lee, Katherine J; Carlin, John B
Multiple imputation is becoming increasingly popular for handling missing data. However, it is often implemented without adequate consideration of whether it offers any advantage over complete case analysis for the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation model, particularly as the amount of missing data increases. Simulated datasets (n = 1000) drawn from a synthetic population were used to explore information recovery from multiple imputation in estimating the coefficient of a binary exposure variable when various proportions of data (10-90%) were set missing at random in a highly-skewed continuous covariate or in the binary exposure. Imputation was performed using multivariate normal imputation (MVNI), with a simple or zero-skewness log transformation to manage non-normality. Bias, precision, mean-squared error and coverage for a set of regression parameter estimates were compared between multiple imputation and complete case analyses. For missingness in the continuous covariate, multiple imputation produced less bias and greater precision for the effect of the binary exposure variable, compared with complete case analysis, with larger gains in precision with more missing data. However, even with only moderate missingness, large bias and substantial under-coverage were apparent in estimating the continuous covariate's effect when skewness was not adequately addressed. For missingness in the binary covariate, all estimates had negligible bias but gains in precision from multiple imputation were minimal, particularly for the coefficient of the binary exposure. Although multiple imputation can be useful if covariates required for confounding adjustment are missing, benefits are likely to be minimal when data are missing in the exposure variable of interest. Furthermore, when there are large amounts of missingness, multiple imputation can become unreliable and introduce bias not present in a
Nagy, Reka; Boutin, Thibaud S; Marten, Jonathan; Huffman, Jennifer E; Kerr, Shona M; Campbell, Archie; Evenden, Louise; Gibson, Jude; Amador, Carmen; Howard, David M; Navarro, Pau; Morris, Andrew; Deary, Ian J; Hocking, Lynne J; Padmanabhan, Sandosh; Smith, Blair H; Joshi, Peter; Wilson, James F; Hastie, Nicholas D; Wright, Alan F; McIntosh, Andrew M; Porteous, David J; Haley, Chris S; Vitart, Veronique; Hayward, Caroline
The Generation Scotland: Scottish Family Health Study (GS:SFHS) is a family-based population cohort with DNA, biological samples, socio-demographic, psychological and clinical data from approximately 24,000 adult volunteers across Scotland. Although data collection was cross-sectional, GS:SFHS became a prospective cohort due to of the ability to link to routine Electronic Health Record (EHR) data. Over 20,000 participants were selected for genotyping using a large genome-wide array. GS:SFHS was analysed using genome-wide association studies (GWAS) to test the effects of a large spectrum of variants, imputed using the Haplotype Research Consortium (HRC) dataset, on medically relevant traits measured directly or obtained from EHRs. The HRC dataset is the largest available haplotype reference panel for imputation of variants in populations of European ancestry and allows investigation of variants with low minor allele frequencies within the entire GS:SFHS genotyped cohort. Genome-wide associations were run on 20,032 individuals using both genotyped and HRC imputed data. We present results for a range of well-studied quantitative traits obtained from clinic visits and for serum urate measures obtained from data linkage to EHRs collected by the Scottish National Health Service. Results replicated known associations and additionally reveal novel findings, mainly with rare variants, validating the use of the HRC imputation panel. For example, we identified two new associations with fasting glucose at variants near to Y_RNA and WDR4 and four new associations with heart rate at SNPs within CSMD1 and ASPH, upstream of HTR1F and between PROKR2 and GPCPD1. All were driven by rare variants (minor allele frequencies in the range of 0.08-1%). Proof of principle for use of EHRs was verification of the highly significant association of urate levels with the well-established urate transporter SLC2A9. GS:SFHS provides genetic data on over 20,000 participants alongside a range of
Eekhout, Iris; de Vet, Henrica Cw; de Boer, Michiel R; Twisk, Jos Wr; Heymans, Martijn W
Previous studies showed that missing data in multi-item scales can best be handled by multiple imputation of item scores. However, when many scales are used, the number of items will become too large for the imputation model to reliably estimate imputations. A solution is to use passive imputation or a parcel summary score that combine and consequently reduce the number of variables in the imputation model. The performance of these methods was evaluated in a simulation study and illustrated in an example. Passive imputation, which updated scale scores from imputed items, and parcel summary scores that use the average over available item scores were compared to using all items simultaneously, imputing total scores of scales and complete-case analysis. Scale scores and coefficient estimates from linear regression were compared to "true" parameters on bias and precision. Passive imputation and using parcel summaries showed smaller bias and more precision than imputing total scores and complete-case analyses. Passive imputation or using parcel summary scores are valid missing data solutions in studies that include many multi-item scales.
Ensor, J; Deeks, J J; Martin, E C; Riley, R D
For tests reporting continuous results, primary studies usually provide test performance at multiple but often different thresholds. This creates missing data when performing a meta-analysis at each threshold. A standard meta-analysis (no imputation [NI]) ignores such missing data. A single imputation (SI) approach was recently proposed to recover missing threshold results. Here, we propose a new method that performs multiple imputation of the missing threshold results using discrete combinations (MIDC). The new MIDC method imputes missing threshold results by randomly selecting from the set of all possible discrete combinations which lie between the results for 2 known bounding thresholds. Imputed and observed results are then synthesised at each threshold. This is repeated multiple times, and the multiple pooled results at each threshold are combined using Rubin's rules to give final estimates. We compared the NI, SI, and MIDC approaches via simulation. Both imputation methods outperform the NI method in simulations. There was generally little difference in the SI and MIDC methods, but the latter was noticeably better in terms of estimating the between-study variances and generally gave better coverage, due to slightly larger standard errors of pooled estimates. Given selective reporting of thresholds, the imputation methods also reduced bias in the summary receiver operating characteristic curve. Simulations demonstrate the imputation methods rely on an equal threshold spacing assumption. A real example is presented. The SI and, in particular, MIDC methods can be used to examine the impact of missing threshold results in meta-analysis of test accuracy studies. © 2017 The Authors. Research Synthesis Methods published by John Wiley & Sons Ltd.
Crameri, Aureliano; von Wyl, Agnes; Koemeda, Margit; Schulthess, Peter; Tschuschke, Volker
The importance of preventing and treating incomplete data in effectiveness studies is nowadays emphasized. However, most of the publications focus on randomized clinical trials (RCT). One flexible technique for statistical inference with missing data is multiple imputation (MI). Since methods such as MI rely on the assumption of missing data being at random (MAR), a sensitivity analysis for testing the robustness against departures from this assumption is required. In this paper we present a sensitivity analysis technique based on posterior predictive checking, which takes into consideration the concept of clinical significance used in the evaluation of intra-individual changes. We demonstrate the possibilities this technique can offer with the example of irregular longitudinal data collected with the Outcome Questionnaire-45 (OQ-45) and the Helping Alliance Questionnaire (HAQ) in a sample of 260 outpatients. The sensitivity analysis can be used to (1) quantify the degree of bias introduced by missing not at random data (MNAR) in a worst reasonable case scenario, (2) compare the performance of different analysis methods for dealing with missing data, or (3) detect the influence of possible violations to the model assumptions (e.g., lack of normality). Moreover, our analysis showed that ratings from the patient's and therapist's version of the HAQ could significantly improve the predictive value of the routine outcome monitoring based on the OQ-45. Since analysis dropouts always occur, repeated measurements with the OQ-45 and the HAQ analyzed with MI are useful to improve the accuracy of outcome estimates in quality assurance assessments and non-randomized effectiveness studies in the field of outpatient psychotherapy.
Meseck, Kristin; Jankowska, Marta M.; Schipperijn, Jasper; Natarajan, Loki; Godbole, Suneeta; Carlson, Jordan; Takemoto, Michelle; Crist, Katie; Kerr, Jacqueline
The main purpose of the present study was to assess the impact of global positioning system (GPS) signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate and viable method for correcting GPS data loss. Accelerometer and GPS data of 782 participants from 8 studies were pooled to represent a range of lifestyles and interactions with the built environment. Periods of GPS signal lapse were identified and extracted. Generalised linear mixed models were run with the number of lapses and the length of lapses as outcomes. The signal lapses were imputed using a simple ruleset, and imputation was validated against person-worn camera imagery. A final generalised linear mixed model was used to identify the difference between the amount of GPS minutes pre- and post-imputation for the activity categories of sedentary, light, and moderate-to-vigorous physical activity. Over 17% of the dataset was comprised of GPS data lapses. No strong associations were found between increasing lapse length and number of lapses and the demographic and built environment variables. A significant difference was found between the pre- and post-imputation minutes for each activity category. No demographic or environmental bias was found for length or number of lapses, but imputation of GPS data may make a significant difference for inclusion of physical activity data that occurred during a lapse. Imputing GPS data lapses is a viable technique for returning spatial context to accelerometer data and improving the completeness of the dataset. PMID:27245796
Meseck, Kristin; Jankowska, Marta M; Schipperijn, Jasper
and viable method for correcting GPS data loss. Accelerometer and GPS data of 782 participants from 8 studies were pooled to represent a range of lifestyles and interactions with the built environment. Periods of GPS signal lapse were identified and extracted. Generalised linear mixed models were run......The main purpose of the present study was to assess the impact of global positioning system (GPS) signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate...... with the number of lapses and the length of lapses as outcomes. The signal lapses were imputed using a simple ruleset, and imputation was validated against person-worn camera imagery. A final generalised linear mixed model was used to identify the difference between the amount of GPS minutes pre- and post...
Full Text Available The main purpose of the present study was to assess the impact of global positioning system (GPS signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate and viable method for correcting GPS data loss. Accelerometer and GPS data of 782 participants from 8 studies were pooled to represent a range of lifestyles and interactions with the built environment. Periods of GPS signal lapse were identified and extracted. Generalised linear mixed models were run with the number of lapses and the length of lapses as outcomes. The signal lapses were imputed using a simple ruleset, and imputation was validated against person-worn camera imagery. A final generalised linear mixed model was used to identify the difference between the amount of GPS minutes pre- and post-imputation for the activity categories of sedentary, light, and moderate-to-vigorous physical activity. Over 17% of the dataset was comprised of GPS data lapses. No strong associations were found between increasing lapse length and number of lapses and the demographic and built environment variables. A significant difference was found between the pre- and postimputation minutes for each activity category. No demographic or environmental bias was found for length or number of lapses, but imputation of GPS data may make a significant difference for inclusion of physical activity data that occurred during a lapse. Imputing GPS data lapses is a viable technique for returning spatial context to accelerometer data and improving the completeness of the dataset.
Erler, Nicole S; Rizopoulos, Dimitris; Rosmalen, Joost van; Jaddoe, Vincent W V; Franco, Oscar H; Lesaffre, Emmanuel M E H
Incomplete data are generally a challenge to the analysis of most large studies. The current gold standard to account for missing data is multiple imputation, and more specifically multiple imputation with chained equations (MICE). Numerous studies have been conducted to illustrate the performance of MICE for missing covariate data. The results show that the method works well in various situations. However, less is known about its performance in more complex models, specifically when the outcome is multivariate as in longitudinal studies. In current practice, the multivariate nature of the longitudinal outcome is often neglected in the imputation procedure, or only the baseline outcome is used to impute missing covariates. In this work, we evaluate the performance of MICE using different strategies to include a longitudinal outcome into the imputation models and compare it with a fully Bayesian approach that jointly imputes missing values and estimates the parameters of the longitudinal model. Results from simulation and a real data example show that MICE requires the analyst to correctly specify which components of the longitudinal process need to be included in the imputation models in order to obtain unbiased results. The full Bayesian approach, on the other hand, does not require the analyst to explicitly specify how the longitudinal outcome enters the imputation models. It performed well under different scenarios. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Goode, Ellen L; Fridley, Brooke L; Vierkant, Robert A
, CDK4, RB1, CDKN2D, and CCNE1) and one gene region (CDKN2A-CDKN2B). Because of the semi-overlapping nature of the 123 assayed tagging SNPs, we performed multiple imputation based on fastPHASE using data from White non-Hispanic study participants and participants in the international HapMap Consortium...... and National Institute of Environmental Health Sciences SNPs Program. Logistic regression assuming a log-additive model was done on combined and imputed data. We observed strengthened signals in imputation-based analyses at several SNPs, particularly CDKN2A-CDKN2B rs3731239; CCND1 rs602652, rs3212879, rs649392...
Background Multiple Imputation as usually implemented assumes that data are Missing At Random (MAR), meaning that the underlying missing data mechanism, given the observed data, is independent of the unobserved data. To explore the sensitivity of the inferences to departures from the MAR assumption, we applied the method proposed by Carpenter et al. (2007). This approach aims to approximate inferences under a Missing Not At random (MNAR) mechanism by reweighting estimates obtained after multiple imputation where the weights depend on the assumed degree of departure from the MAR assumption. Methods The method is illustrated with epidemiological data from a surveillance system of hepatitis C virus (HCV) infection in France during the 2001–2007 period. The subpopulation studied included 4343 HCV infected patients who reported drug use. Risk factors for severe liver disease were assessed. After performing complete-case and multiple imputation analyses, we applied the sensitivity analysis to 3 risk factors of severe liver disease: past excessive alcohol consumption, HIV co-infection and infection with HCV genotype 3. Results In these data, the association between severe liver disease and HIV was underestimated, if given the observed data the chance of observing HIV status is high when this is positive. Inference for two other risk factors were robust to plausible local departures from the MAR assumption. Conclusions We have demonstrated the practical utility of, and advocate, a pragmatic widely applicable approach to exploring plausible departures from the MAR assumption post multiple imputation. We have developed guidelines for applying this approach to epidemiological studies. PMID:22681630
Weng, Z; Zhang, Z; Zhang, Q; Fu, W; He, S; Ding, X
Imputation of high-density genotypes from low- or medium-density platforms is a promising way to enhance the efficiency of whole-genome selection programs at low cost. In this study, we compared the efficiency of three widely used imputation algorithms (fastPHASE, BEAGLE and findhap) using Chinese Holstein cattle with Illumina BovineSNP50 genotypes. A total of 2108 cattle were randomly divided into a reference population and a test population to evaluate the influence of the reference population size. Three bovine chromosomes, BTA1, 16 and 28, were used to represent large, medium and small chromosome size, respectively. We simulated different scenarios by randomly masking 20%, 40%, 80% and 95% single-nucleotide polymorphisms (SNPs) on each chromosome in the test population to mimic different SNP density panels. Illumina Bovine3K and Illumina BovineLD (6909 SNPs) information was also used. We found that the three methods showed comparable accuracy when the proportion of masked SNPs was low. However, the difference became larger when more SNPs were masked. BEAGLE performed the best and was most robust with imputation accuracies >90% in almost all situations. fastPHASE was affected by the proportion of masked SNPs, especially when the masked SNP rate was high. findhap ran the fastest, whereas its accuracies were lower than those of BEAGLE but higher than those of fastPHASE. In addition, enlarging the reference population improved the imputation accuracy for BEAGLE and findhap, but did not affect fastPHASE. Considering imputation accuracy and computational requirements, BEAGLE has been found to be more reliable for imputing genotypes from low- to high-density genotyping platforms.
Full Text Available Genome-wide association studies (GWASs have revealed many SNPs and genes associated with osteoporosis. However, influence of these SNPs and genes on the predisposition to osteoporosis is not fully understood. We aimed to identify osteoporosis GWASs-associated SNPs potentially influencing the binding affinity of transcription factors and miRNAs, and reveal enrichment signaling pathway and "hub" genes of osteoporosis GWAS-associated genes.We conducted multiple computational analyses to explore function and mechanisms of osteoporosis GWAS-associated SNPs and genes, including SNP conservation analysis and functional annotation (influence of SNPs on transcription factors and miRNA binding, gene ontology analysis, pathway analysis and protein-protein interaction analysis.Our results suggested that a number of SNPs potentially influence the binding affinity of transcription factors (NFATC2, MEF2C, SOX9, RUNX2, ESR2, FOXA1 and STAT3 and miRNAs. Osteoporosis GWASs-associated genes showed enrichment of Wnt signaling pathway, basal cell carcinoma and Hedgehog signaling pathway. Highly interconnected "hub" genes revealed by interaction network analysis are RUNX2, SP7, TNFRSF11B, LRP5, DKK1, ESR1 and SOST.Our results provided the targets for further experimental assessment and further insight on osteoporosis pathophysiology.
Qin, Longjuan; Liu, Yuyong; Wang, Ya; Wu, Guiju; Chen, Jie; Ye, Weiyuan; Yang, Jiancai; Huang, Qingyang
Genome-wide association studies (GWASs) have revealed many SNPs and genes associated with osteoporosis. However, influence of these SNPs and genes on the predisposition to osteoporosis is not fully understood. We aimed to identify osteoporosis GWASs-associated SNPs potentially influencing the binding affinity of transcription factors and miRNAs, and reveal enrichment signaling pathway and "hub" genes of osteoporosis GWAS-associated genes. We conducted multiple computational analyses to explore function and mechanisms of osteoporosis GWAS-associated SNPs and genes, including SNP conservation analysis and functional annotation (influence of SNPs on transcription factors and miRNA binding), gene ontology analysis, pathway analysis and protein-protein interaction analysis. Our results suggested that a number of SNPs potentially influence the binding affinity of transcription factors (NFATC2, MEF2C, SOX9, RUNX2, ESR2, FOXA1 and STAT3) and miRNAs. Osteoporosis GWASs-associated genes showed enrichment of Wnt signaling pathway, basal cell carcinoma and Hedgehog signaling pathway. Highly interconnected "hub" genes revealed by interaction network analysis are RUNX2, SP7, TNFRSF11B, LRP5, DKK1, ESR1 and SOST. Our results provided the targets for further experimental assessment and further insight on osteoporosis pathophysiology.
Kim, Kwangwoo; Bang, So-Young; Lee, Hye-Soon; Bae, Sang-Cheol
Genetic variations of human leukocyte antigen (HLA) genes within the major histocompatibility complex (MHC) locus are strongly associated with disease susceptibility and prognosis for many diseases, including many autoimmune diseases. In this study, we developed a Korean HLA reference panel for imputing classical alleles and amino acid residues of several HLA genes. An HLA reference panel has potential for use in identifying and fine-mapping disease associations with the MHC locus in East Asian populations, including Koreans. A total of 413 unrelated Korean subjects were analyzed for single nucleotide polymorphisms (SNPs) at the MHC locus and six HLA genes, including HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1. The HLA reference panel was constructed by phasing the 5,858 MHC SNPs, 233 classical HLA alleles, and 1,387 amino acid residue markers from 1,025 amino acid positions as binary variables. The imputation accuracy of the HLA reference panel was assessed by measuring concordance rates between imputed and genotyped alleles of the HLA genes from a subset of the study subjects and East Asian HapMap individuals. Average concordance rates were 95.6% and 91.1% at 2-digit and 4-digit allele resolutions, respectively. The imputation accuracy was minimally affected by SNP density of a test dataset for imputation. In conclusion, the Korean HLA reference panel we developed was highly suitable for imputing HLA alleles and amino acids from MHC SNPs in East Asians, including Koreans.
Full Text Available Genetic variations of human leukocyte antigen (HLA genes within the major histocompatibility complex (MHC locus are strongly associated with disease susceptibility and prognosis for many diseases, including many autoimmune diseases. In this study, we developed a Korean HLA reference panel for imputing classical alleles and amino acid residues of several HLA genes. An HLA reference panel has potential for use in identifying and fine-mapping disease associations with the MHC locus in East Asian populations, including Koreans. A total of 413 unrelated Korean subjects were analyzed for single nucleotide polymorphisms (SNPs at the MHC locus and six HLA genes, including HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1. The HLA reference panel was constructed by phasing the 5,858 MHC SNPs, 233 classical HLA alleles, and 1,387 amino acid residue markers from 1,025 amino acid positions as binary variables. The imputation accuracy of the HLA reference panel was assessed by measuring concordance rates between imputed and genotyped alleles of the HLA genes from a subset of the study subjects and East Asian HapMap individuals. Average concordance rates were 95.6% and 91.1% at 2-digit and 4-digit allele resolutions, respectively. The imputation accuracy was minimally affected by SNP density of a test dataset for imputation. In conclusion, the Korean HLA reference panel we developed was highly suitable for imputing HLA alleles and amino acids from MHC SNPs in East Asians, including Koreans.
Schork, Andrew J.; Thompson, Wesley K.; Pham, Phillip; Torkamani, Ali; Roddey, J. Cooper; Sullivan, Patrick F.; Kelsoe, John R.; O'Donovan, Michael C.; Furberg, Helena; Schork, Nicholas J.; Andreassen, Ole A.; Dale, Anders M.; Absher, Devin; Agudo, Antonio; Almgren, Peter; Ardissino, Diego; Assimes, Themistocles L.; Bandinelli, Stephania; Barzan, Luigi; Bencko, Vladimir; Benhamou, Simone; Benjamin, Emelia J.; Bernardinelli, Luisa; Bis, Joshua; Boehnke, Michael; Boerwinkle, Eric; Boomsma, Dorret I.; Brennan, Paul; Canova, Cristina; Castellsagué, Xavier; Chanock, Stephen; Chasman, Daniel; Conway, David I.; Dackor, Jennifer; de Geus, Eco J. C.; Duan, Jubao; Elosua, Roberto; Everett, Brendan; Fabianova, Eleonora; Ferrucci, Luigi; Foretova, Lenka; Fortmann, Stephen P.; Franceschini, Nora; Frayling, Timothy; Furberg, Curt; Gejman, Pablo V.; Groop, Leif; Gu, Fangyi; Guralnik, Jack; Hankinson, Susan E.; Haritunians, Talin; Healy, Claire; Hofman, Albert; Holcátová, Ivana; Hunter, David J.; Hwang, Shih-Jen; Ioannidis, John P. A.; Iribarren, Carlos; Jackson, Anne U.; Janout, Vladimir; Kaprio, Jaakko; Kim, Yunjung; Kjaerheim, Kristina; Knowles, Joshua W.; Kraft, Peter; Ladenvall, Claes; Lagiou, Pagona; Lanthrop, Mark; Lerman, Caryn; Levinson, Douglas F.; Levy, Daniel; Li, Ming D.; Lin, Dan Yu; Lips, Esther H.; Lissowska, Jolanta; Lowry, Ray; Lucas, Gavin; Macfarlane, Tatiana V.; Maes, Hermine; Mannucci, Pier Mannuccio; Mates, Dana; Mauri, Francesco; McGovern, Janet Audrain; McKay, James D.; McKnight, Barbara; Melander, Olle; Merlini, Piera Angelica; Milaneschi, Yuri; Mohlke, Karen L.; O'Donnell, Christopher J.; Pare, Guillaume; Penninx, Brenda W.; Perry, John; Posthuma, Danielle; Preis, Sarah Rosner; Psaty, Bruce; Quertermous, Thomas; Ramachandran, Vasan S.; Richiardi, Lorenzo; Ridker, Paul; Rose, Jed; Rudnai, Peter; Salomaa, Veikko; Sanders, Alan R.; Schwartz, Stephen M.; Shi, Jianxin; Smit, Johannes H.; Stringham, Heather M.; Szeszenia-Dabrowska, Neonilia; Tanaka, Toshiko; Taylor, Kent; Thacker, Evan; Thornton, Laura; Tiemeier, Henning; Tuomilehto, Jaakko; Uitterlinden, Andre G.; van Duijn, Cornelia M.; Vink, Jacqueline M.; Vogelzangs, Nicole; Voight, Benjamin F.; Walter, Stefan; Willemsen, Gonneke; Zaridze, David; Znaor, Ariana; Akil, Huda; Anjorin, Adebayo; Backlund, Lena; Badner, Judith A.; Barchas, Jack D.; Barrett, Thomas B.; Bass, Nick; Bauer, Michael; Bellivier, Frank; Bergen, Sarah E.; Berrettini, Wade; Blackwood, Douglas; Bloss, Cinnamon S.; Breen, Gerome; Breuer, René; Bunner, William E.; Burmeister, Margit; Byerley, William; Caesar, Sian; Chambert, Kim; Cichon, Sven; St Clair, David; Collier, David A.; Corvin, Aiden; Coryell, William H.; Craddock, Nicholas; Craig, David W.; Daly, Mark; Day, Richard; Degenhardt, Franziska; Djurovic, Srdjan; Dudbridge, Frank; Edenberg, Howard J.; Elkin, Amanda; Etain, Bruno; Farmer, Anne E.; Ferreira, Manuel A.; Ferrier, I. Nicol; Flickinger, Matthew; Foroud, Tatiana; Frank, Josef; Fraser, Christine; Frisén, Louise; Gershon, Elliot S.; Gill, Michael; Gordon-Smith, Katherine; Green, Elaine K.; Greenwood, Tiffany A.; Grozeva, Detelina; Guan, Weihua; Gurling, Hugh; Gustafsson, Ómar; Hamshere, Marian L.; Hautzinger, Martin; Herms, Stefan; Hipolito, Maria; Holmans, Peter A.; Hultman, Christina M.; Jamain, Stéphane; Jones, Edward G.; Jones, Ian; Jones, Lisa; Kandaswamy, Radhika; Kennedy, James L.; Kirov, George K.; Koller, Daniel L.; Kwan, Phoenix; Landén, Mikael; Langstrom, Niklas; Lathrop, Mark; Lawrence, Jacob; Lawson, William B.; Leboyer, Marion; Lee, Phil H.; Li, Jun; Lichtenstein, Paul; Lin, Danyu; Liu, Chunyu; Lohoff, Falk W.; Lucae, Susanne; Mahon, Pamela B.; Maier, Wolfgang; Martin, Nicholas G.; Mattheisen, Manuel; Matthews, Keith; Mattingsdal, Morten; McGhee, Kevin A.; McGuffin, Peter; McInnis, Melvin G.; McIntosh, Andrew; McKinney, Rebecca; McLean, Alan W.; McMahon, Francis J.; McQuillin, Andrew; Meier, Sandra; Melle, Ingrid; Meng, Fan; Mitchell, Philip B.; Montgomery, Grant W.; Moran, Jennifer; Morken, Gunnar; Morris, Derek W.; Moskvina, Valentina; Muglia, Pierandrea; Mühleisen, Thomas W.; Muir, Walter J.; Müller-Myhsok, Bertram; Myers, Richard M.; Nievergelt, Caroline M.; Nikolov, Ivan; Nimgaonkar, Vishwajit; Nöthen, Markus M.; Nurnberger, John I.; Nwulia, Evaristus A.; O'Dushlaine, Colm; Osby, Urban; Óskarsson, Högni; Owen, Michael J.; Petursson, Hannes; Pickard, Benjamin S.; Porgeirsson, Porgeir; Potash, James B.; Propping, Peter; Purcell, Shaun M.; Quinn, Emma; Raychaudhuri, Soumya; Rice, John; Rietschel, Marcella; Ruderfer, Douglas; Schalling, Martin; Schatzberg, Alan F.; Scheftner, William A.; Schofield, Peter R.; Schulze, Thomas G.; Schumacher, Johannes; Schwarz, Markus M.; Scolnick, Ed; Scott, Laura J.; Shilling, Paul D.; Sigurdsson, Engilbert; Sklar, Pamela; Smith, Erin N.; Stefansson, Hreinn; Stefansson, Kari; Steffens, Michael; Steinberg, Stacy; Strauss, John; Strohmaier, Jana; Szelinger, Szabocls; Thompson, Robert C.; Tozzi, Federica; Treutlein, Jens; Vincent, John B.; Watson, Stanley J.; Wienker, Thomas F.; Williamson, Richard; Witt, Stephanie H.; Wright, Adam; Xu, Wei; Young, Allan H.; Zandi, Peter P.; Zhang, Peng; Zöllner, Sebastian; Agartz, Ingrid; Albus, Margot; Alexander, Madeline; Amdur, Richard L.; Amin, Farooq; Bass, Nicholas; Bitter, István; Black, Donald W.; Børglum, Anders D.; Brown, Matthew A.; Bruggeman, Richard; Buccola, Nancy G.; Byerley, William F.; Cahn, Wiepke; Cantor, Rita M.; Carr, Vaughan J.; Catts, Stanley V.; Choudhury, Khalid; Cloninger, C. Robert; Cormican, Paul; Danoy, Patrick A.; Datta, Susmita; DeHert, Marc; Demontis, Ditte; Dikeos, Dimitris; Donnelly, Peter; Donohoe, Gary; Duong, Linh; Dwyer, Sarah; Fanous, Ayman; Fink-Jensen, Anders; Freedman, Robert; Freimer, Nelson B.; Friedl, Marion; Georgieva, Lyudmila; Giegling, Ina; Glenthøj, Birte; Godard, Stephanie; Golimbet, Vera; de Haan, Lieuwe; Hansen, Mark; Hansen, Thomas; Hartmann, Annette M.; Henskens, Frans A.; Hougaard, David M.; Ingason, Andrés; Jablensky, Assen V.; Jakobsen, Klaus D.; Jay, Maurice; Jönsson, Erik G.; Jürgens, Gesche; Kahn, René S.; Keller, Matthew C.; Kendler, Kenneth S.; Kenis, Gunter; Kenny, Elaine; Konnerth, Heike; Konte, Bettina; Krabbendam, Lydia; Krasucki, Robert; Lasseter, Virginia K.; Laurent, Claudine; Lencz, Todd; Lerer, F. Bernard; Liang, Kung-Yee; Lieberman, Jeffrey A.; Linszen, Don H.; Lönnqvist, Jouko; Loughland, Carmel M.; Maclean, Alan W.; Maher, Brion S.; Malhotra, Anil K.; Mallet, Jacques; Malloy, Pat; McGrath, John J.; McLean, Duncan E.; Michie, Patricia T.; Milanova, Vihra; Mors, Ole; Mortensen, Preben B.; Mowry, Bryan J.; Myin-Germeys, Inez; Neale, Benjamin; Nertney, Deborah A.; Nestadt, Gerald; Nielsen, Jimmi; Nordentoft, Merete; Norton, Nadine; O'Neill, F. Anthony; Olincy, Ann; Olsen, Line; Ophoff, Roel A.; Ørntoft, Torben F.; van Os, Jim; Pantelis, Christos; Papadimitriou, George; Pato, Carlos N.; Pato, Michele T.; Peltonen, Leena; Pickard, Ben; Pietiläinen, Olli P. H.; Pimm, Jonathan; Pulver, Ann E.; Puri, Vinay; Quested, Digby; Rasmussen, Henrik B.; Réthelyi, János M.; Ribble, Robert; Riley, Brien P.; Rossin, Lizzy; Ruggeri, Mirella; Rujescu, Dan; Schall, Ulrich; Schwab, Sibylle G.; Scolnick, Edward; Scott, Rodney J.; Silverman, Jeremy M.; Spencer, Chris C. A.; Strange, Amy; Strengman, Eric; Stroup, T. Scott; Suvisaari, Jaana; Terenius, Lars; Thirumalai, Srinivasa; Timm, Sally; Toncheva, Draga; Tosato, Sarah; van den Oord, Edwin J. C. G.; Veldink, Jan; Visscher, Peter M.; Walsh, Dermot; Wang, August G.; Werge, Thomas; Wiersma, Durk; Wildenauer, Dieter B.; Williams, Hywel J.; Williams, Nigel M.; van Winkel, Ruud; Wormley, Brandon; Zammit, Stan
Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery
Schork, Andrew J; Thompson, Wesley K; Pham, Phillip
Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discov...
Casto, Amanda M; Feldman, Marcus W
Genome-wide association studies (GWAS) have identified more than 2,000 trait-SNP associations, and the number continues to increase. GWAS have focused on traits with potential consequences for human fitness, including many immunological, metabolic, cardiovascular, and behavioral phenotypes. Given the polygenic nature of complex traits, selection may exert its influence on them by altering allele frequencies at many associated loci, a possibility which has yet to be explored empirically. Here we use 38 different measures of allele frequency variation and 8 iHS scores to characterize over 1,300 GWAS SNPs in 53 globally distributed human populations. We apply these same techniques to evaluate SNPs grouped by trait association. We find that groups of SNPs associated with pigmentation, blood pressure, infectious disease, and autoimmune disease traits exhibit unusual allele frequency patterns and elevated iHS scores in certain geographical locations. We also find that GWAS SNPs have generally elevated scores for measures of allele frequency variation and for iHS in Eurasia and East Asia. Overall, we believe that our results provide evidence for selection on several complex traits that has caused changes in allele frequencies and/or elevated iHS scores at a number of associated loci. Since GWAS SNPs collectively exhibit elevated allele frequency measures and iHS scores, selection on complex traits may be quite widespread. Our findings are most consistent with this selection being either positive or negative, although the relative contributions of the two are difficult to discern. Our results also suggest that trait-SNP associations identified in Eurasian samples may not be present in Africa, Oceania, and the Americas, possibly due to differences in linkage disequilibrium patterns. This observation suggests that non-Eurasian and non-East Asian sample populations should be included in future GWAS.
Full Text Available Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS. Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991, 2 (1993-1995, and 3 (1998-1999 was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.
Ferro, Mark A
The aim of this research was to examine, in an exploratory manner, whether cross-sectional multiple imputation generates valid parameter estimates for a latent growth curve model in a longitudinal data set with nonmonotone missingness. A simulated longitudinal data set of N = 5000 was generated and consisted of a continuous dependent variable, assessed at three measurement occasions and a categorical time-invariant independent variable. Missing data had a nonmonotone pattern and the proportion of missingness increased from the initial to the final measurement occasion (5%-20%). Three methods were considered to deal with missing data: listwise deletion, full-information maximum likelihood, and multiple imputation. A latent growth curve model was specified and analysis of variance was used to compare parameter estimates between the full data set and missing data approaches. Multiple imputation resulted in significantly lower slope variance compared with the full data set. There were no differences in any parameter estimates between the multiple imputation and full-information maximum likelihood approaches. This study suggested that in longitudinal studies with nonmonotone missingness, cross-sectional imputation at each time point may be viable and produces estimates comparable with those obtained with full-information maximum likelihood. Future research pursuing the validity of this method is warranted. Copyright © 2014 Elsevier Inc. All rights reserved.
Keogh, Ruth H; White, Ian R
In many large prospective cohorts, expensive exposure measurements cannot be obtained for all individuals. Exposure-disease association studies are therefore often based on nested case-control or case-cohort studies in which complete information is obtained only for sampled individuals. However, in the full cohort, there may be a large amount of information on cheaply available covariates and possibly a surrogate of the main exposure(s), which typically goes unused. We view the nested case-control or case-cohort study plus the remainder of the cohort as a full-cohort study with missing data. Hence, we propose using multiple imputation (MI) to utilise information in the full cohort when data from the sub-studies are analysed. We use the fully observed data to fit the imputation models. We consider using approximate imputation models and also using rejection sampling to draw imputed values from the true distribution of the missing values given the observed data. Simulation studies show that using MI to utilise full-cohort information in the analysis of nested case-control and case-cohort studies can result in important gains in efficiency, particularly when a surrogate of the main exposure is available in the full cohort. In simulations, this method outperforms counter-matching in nested case-control studies and a weighted analysis for case-cohort studies, both of which use some full-cohort information. Approximate imputation models perform well except when there are interactions or non-linear terms in the outcome model, where imputation using rejection sampling works well. Copyright © 2013 John Wiley & Sons, Ltd.
Corbin, Laura J; Kranis, Andreas; Blott, Sarah C; Swinburne, June E; Vaudin, Mark; Bishop, Stephen C; Woolliams, John A
Despite the dramatic reduction in the cost of high-density genotyping that has occurred over the last decade, it remains one of the limiting factors for obtaining the large datasets required for genomic studies of disease in the horse. In this study, we investigated the potential for low-density genotyping and subsequent imputation to address this problem. Using the haplotype phasing and imputation program, BEAGLE, it is possible to impute genotypes from low- to high-density (50K) in the Thoroughbred horse with reasonable to high accuracy. Analysis of the sources of variation in imputation accuracy revealed dependence both on the minor allele frequency of the single nucleotide polymorphisms (SNPs) being imputed and on the underlying linkage disequilibrium structure. Whereas equidistant spacing of the SNPs on the low-density panel worked well, optimising SNP selection to increase their minor allele frequency was advantageous, even when the panel was subsequently used in a population of different geographical origin. Replacing base pair position with linkage disequilibrium map distance reduced the variation in imputation accuracy across SNPs. Whereas a 1K SNP panel was generally sufficient to ensure that more than 80% of genotypes were correctly imputed, other studies suggest that a 2K to 3K panel is more efficient to minimize the subsequent loss of accuracy in genomic prediction analyses. The relationship between accuracy and genotyping costs for the different low-density panels, suggests that a 2K SNP panel would represent good value for money. Low-density genotyping with a 2K SNP panel followed by imputation provides a compromise between cost and accuracy that could promote more widespread genotyping, and hence the use of genomic information in horses. In addition to offering a low cost alternative to high-density genotyping, imputation provides a means to combine datasets from different genotyping platforms, which is becoming necessary since researchers are
In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time. In this paper, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the L 1-penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a Majorization-Minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a Multiple Sclerosis data set and simulated data sets and shows promise in biomarker detection.
Palmer, Cameron; Pe’er, Itsik
Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data. PMID:27310603
Full Text Available Background: The activity of thiopurine methyltransferase (TPMT is subject to genetic variation. Loss-of-function alleles are associated with various degrees of myelosuppression after treatment with thiopurine drugs, thus genotype-based dosing recommendations currently exist. The aim of this study was to evaluate the potential utility of leveraging genomic data from large biorepositories in the identification of individuals with TPMT defective alleles. Material and methods: TPMT variants were imputed using the 1,000 Genomes Project reference panel in 87,979 samples from the biobank at The Children’s Hospital of Philadelphia. Population ancestry was determined by principal component analysis using HapMap3 samples as reference. Frequencies of the TPMT imputed alleles, genotypes and the associated phenotype were determined across the different populations. A sample of 630 subjects with genotype data from Sanger sequencing (N=59 and direct genotyping (N=583 (12 samples overlapping in the two groups was used to check the concordance between the imputed and observed genotypes, as well as the sensitivity, specificity and positive and negative predictive values of the imputation. Results: Two SNPs (rs1800460 and rs1142345 that represent three TPMT alleles (*3A, *3B, and *3C were imputed with adequate quality. Frequency for the associated enzyme activity varied across populations and 89.36-94.58% were predicted to have normal TPMT activity, 5.3-10.31% intermediate and 0.12-0.34% poor activities. Overall, 98.88% of individuals (623/630 were correctly imputed into carrying no risk alleles (553/553, heterozygous (45/46 and homozygous (25/31. Sensitivity, specificity and predictive values of imputation were over 90% in all cases except for the sensitivity of imputing homozygous subjects that was 80.64%. Conclusion: Imputation of TPMT alleles from existing genomic data can be used as a first step in the screening of individuals at risk of developing serious
Bhasi, Kavitha; Zhang, Li; Brazeau, Daniel; Zhang, Aidong; Ramanathan, Murali
The size, dimensionality and the limited range of the data values makes visualization of single nucleotide polymorphism (SNP) datasets challenging. The purpose of this study is to evaluate the usefulness of 3D VizStruct, a novel multi-dimensional data visualization technique for SNP datasets capable of identifying informative SNPs in genome-wide association studies. VizStruct is an interactive visualization technique that reduces multi-dimensional data to three dimensions using a combination of the discrete Fourier transform and the Kullback–Leibler divergence. The performance of 3D VizStruct was challenged with several diverse, biologically relevant published datasets including the human lipoprotein lipase (LPL) gene locus, the human Y-chromosome in several populations and a multi-locus genotype dataset of coral samples from four populations. In every case, the SNPs and or polymorphic markers identified by the 3D VizStruct mapping were predictive of the underlying biology. PMID:16899448
Full Text Available Recent advances in high-throughput genotyping technologies have provided the opportunity to map genes using associations between complex traits and markers. Genome-wide association studies (GWAS based on either a single marker or haplotype have identified genetic variants and underlying genetic mechanisms of quantitative traits. Prompted by the achievements of studies examining economic traits in cattle and to verify the consistency of these two methods using real data, the current study was conducted to construct the haplotype structure in the bovine genome and to detect relevant genes genuinely affecting a carcass trait and a meat quality trait. Using the Illumina BovineHD BeadChip, 942 young bulls with genotyping data were introduced as a reference population to identify the genes in the beef cattle genome significantly associated with foreshank weight and triglyceride levels. In total, 92,553 haplotype blocks were detected in the genome. The regions of high linkage disequilibrium extended up to approximately 200 kb, and the size of haplotype blocks ranged from 22 bp to 199,266 bp. Additionally, the individual SNP analysis and the haplotype-based analysis detected similar regions and common SNPs for these two representative traits. A total of 12 and 7 SNPs in the bovine genome were significantly associated with foreshank weight and triglyceride levels, respectively. By comparison, 4 and 5 haplotype blocks containing the majority of significant SNPs were strongly associated with foreshank weight and triglyceride levels, respectively. In addition, 36 SNPs with high linkage disequilibrium were detected in the GNAQ gene, a potential hotspot that may play a crucial role for regulating carcass trait components.
Zhang, Zhongrong; Yang, Xuan; Li, Hao; Li, Weide; Yan, Haowen; Shi, Fei
The techniques for data analyses have been widely developed in past years, however, missing data still represent a ubiquitous problem in many scientific fields. In particular, dealing with missing spatiotemporal data presents an enormous challenge. Nonetheless, in recent years, a considerable amount of research has focused on spatiotemporal problems, making spatiotemporal missing data imputation methods increasingly indispensable. In this paper, a novel spatiotemporal hybrid method is proposed to verify and imputed spatiotemporal missing values. This new method, termed SOM-FLSSVM, flexibly combines three advanced techniques: self-organizing feature map (SOM) clustering, the fruit fly optimization algorithm (FOA) and the least squares support vector machine (LSSVM). We employ a cross-validation (CV) procedure and FOA swarm intelligence optimization strategy that can search available parameters and determine the optimal imputation model. The spatiotemporal underground water data for Minqin County, China, were selected to test the reliability and imputation ability of SOM-FLSSVM. We carried out a validation experiment and compared three well-studied models with SOM-FLSSVM using a different missing data ratio from 0.1 to 0.8 in the same data set. The results demonstrate that the new hybrid method performs well in terms of both robustness and accuracy for spatiotemporal missing data.
Brown, Samuel M; Duggal, Abhijit; Hou, Peter C; Tidswell, Mark; Khan, Akram; Exline, Matthew; Park, Pauline K; Schoenfeld, David A; Liu, Ming; Grissom, Colin K; Moss, Marc; Rice, Todd W; Hough, Catherine L; Rivers, Emanuel; Thompson, B Taylor; Brower, Roy G
In the contemporary ICU, mechanically ventilated patients may not have arterial blood gas measurements available at relevant timepoints. Severity criteria often depend on arterial blood gas results. Retrospective studies suggest that nonlinear imputation of PaO2/FIO2 from SpO2/FIO2 is accurate, but this has not been established prospectively among mechanically ventilated ICU patients. The objective was to validate the superiority of nonlinear imputation of PaO2/FIO2 among mechanically ventilated patients and understand what factors influence the accuracy of imputation. Simultaneous SpO2, oximeter characteristics, receipt of vasopressors, and skin pigmentation were recorded at the time of a clinical arterial blood gas. Acute respiratory distress syndrome criteria were recorded. For each imputation method, we calculated both imputation error and the area under the curve for patients meeting criteria for acute respiratory distress syndrome (PaO2/FIO2 ≤ 300) and moderate-severe acute respiratory distress syndrome (PaO2/FIO2 ≤ 150). Nine hospitals within the Prevention and Early Treatment of Acute Lung Injury network. We prospectively enrolled 703 mechanically ventilated patients admitted to the emergency departments or ICUs of participating study hospitals. None. We studied 1,034 arterial blood gases from 703 patients; 650 arterial blood gases were associated with SpO2 less than or equal to 96%. Nonlinear imputation had consistently lower error than other techniques. Among all patients, nonlinear had a lower error (p < 0.001) and higher (p < 0.001) area under the curve (0.87; 95% CI, 0.85-0.90) for PaO2/FIO2 less than or equal to 300 than linear/log-linear (0.80; 95% CI, 0.76-0.83) imputation. All imputation methods better identified moderate-severe acute respiratory distress syndrome (PaO2/FIO2 ≤ 150); nonlinear imputation remained superior (p < 0.001). For PaO2/FIO2 less than or equal to 150, the sensitivity and specificity for nonlinear imputation were 0
Aßmann, Christian; Würbach, Ariane; Goßmann, Solange; Geissler, Ferdinand; Bela, Anika
Large-scale surveys typically exhibit data structures characterized by rich mutual dependencies between surveyed variables and individual-specific skip patterns. Despite high efforts in fieldwork and questionnaire design, missing values inevitably occur. One approach for handling missing values is to provide multiply imputed data sets, thus…
Full Text Available Abstract Background Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required. Results We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo. Conclusions ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at http://en.sourceforge.jp/projects/parallelgwas/releases/.
Plumpton, Catrin O; Morris, Tim; Hughes, Dyfrig A; White, Ian R
Missing data in a large scale survey presents major challenges. We focus on performing multiple imputation by chained equations when data contain multiple incomplete multi-item scales. Recent authors have proposed imputing such data at the level of the individual item, but this can lead to infeasibly large imputation models. We use data gathered from a large multinational survey, where analysis uses separate logistic regression models in each of nine country-specific data sets. In these data, applying multiple imputation by chained equations to the individual scale items is computationally infeasible. We propose an adaptation of multiple imputation by chained equations which imputes the individual scale items but reduces the number of variables in the imputation models by replacing most scale items with scale summary scores. We evaluate the feasibility of the proposed approach and compare it with a complete case analysis. We perform a simulation study to compare the proposed method with alternative approaches: we do this in a simplified setting to allow comparison with the full imputation model. For the case study, the proposed approach reduces the size of the prediction models from 134 predictors to a maximum of 72 and makes multiple imputation by chained equations computationally feasible. Distributions of imputed data are seen to be consistent with observed data. Results from the regression analysis with multiple imputation are similar to, but more precise than, results for complete case analysis; for the same regression models a 39% reduction in the standard error is observed. The simulation shows that our proposed method can perform comparably against the alternatives. By substantially reducing imputation model sizes, our adaptation makes multiple imputation feasible for large scale survey data with multiple multi-item scales. For the data considered, analysis of the multiply imputed data shows greater power and efficiency than complete case analysis. The
Sepúlveda, Nuno; Manjurano, Alphaxard; Drakeley, Chris; Clark, Taane G
Multiple imputation based on chained equations (MICE) is an alternative missing genotype method that can use genetic and nongenetic auxiliary data to inform the imputation process. Previously, MICE was successfully tested on strongly linked genetic data. We have now tested it on data of the HBA2 gene which, by the experimental design used in a malaria association study in Tanzania, shows a high missing data percentage and is weakly linked with the remaining genetic markers in the data set. We constructed different imputation models and studied their performance under different missing data conditions. Overall, MICE failed to accurately predict the true genotypes. However, using the best imputation model for the data, we obtained unbiased estimates for the genetic effects, and association signals of the HBA2 gene on malaria positivity. When the whole data set was analyzed with the same imputation model, the association signal increased from 0.80 to 2.70 before and after imputation, respectively. Conversely, postimputation estimates for the genetic effects remained the same in relation to the complete case analysis but showed increased precision. We argue that these postimputation estimates are reasonably unbiased, as a result of a good study design based on matching key socio-environmental factors. © 2014 The Authors. Annals of Human Genetics published by John Wiley & Sons Ltd and University College London (UCL).
Full Text Available Genomic selection uses genome-wide marker information to predict breeding values for traits of economic interest, and is more accurate than pedigree-based methods. The development of high density SNP arrays for Atlantic salmon has enabled genomic selection in selective breeding programs, alongside high-resolution association mapping of the genetic basis of complex traits. However, in sibling testing schemes typical of salmon breeding programs, trait records are available on many thousands of fish with close relationships to the selection candidates. Therefore, routine high density SNP genotyping may be prohibitively expensive. One means to reducing genotyping cost is the use of genotype imputation, where selected key animals (e.g., breeding program parents are genotyped at high density, and the majority of individuals (e.g., performance tested fish and selection candidates are genotyped at much lower density, followed by imputation to high density. The main objectives of the current study were to assess the feasibility and accuracy of genotype imputation in the context of a salmon breeding program. The specific aims were: (i to measure the accuracy of genotype imputation using medium (25 K and high (78 K density mapped SNP panels, by masking varying proportions of the genotypes and assessing the correlation between the imputed genotypes and the true genotypes; and (ii to assess the efficacy of imputed genotype data in genomic prediction of key performance traits (sea lice resistance and body weight. Imputation accuracies of up to 0.90 were observed using the simple two-generation pedigree dataset, and moderately high accuracy (0.83 was possible even with very low density SNP data (∼250 SNPs. The performance of genomic prediction using imputed genotype data was comparable to using true genotype data, and both were superior to pedigree-based prediction. These results demonstrate that the genotype imputation approach used in this study can
Ramstein, Guillaume P; Lipka, Alexander E; Lu, Fei; Costich, Denise E; Cherney, Jerome H; Buckler, Edward S; Casler, Michael D
Genotyping by sequencing allows for large-scale genetic analyses in plant species with no reference genome, but sets the challenge of sound inference in presence of uncertain genotypes. We report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundinacea L., Phalaris caesia Nees), a cool-season grass species with potential as a biofuel crop. Our study involved two linkage populations and an association panel of 590 reed canarygrass genotypes. Plants were assayed for up to 5228 single nucleotide polymorphism markers and 35 traits. The genotypic markers were derived from low-depth sequencing with 78% missing data on average. To soundly infer marker-trait associations, multiple imputation (MI) was used: several imputes of the marker data were generated to reflect imputation uncertainty and association tests were performed on marker effects across imputes. A total of nine significant markers were identified, three of which showed significant homology with the Brachypodium dystachion genome. Because no physical map of the reed canarygrass genome was available, imputation was conducted using classification trees. In general, MI showed good consistency with the complete-case analysis and adequate control over imputation uncertainty. A gain in significance of marker effects was achieved through MI, but only for rare cases when missing data were <45%. In addition to providing insight into the genetic basis of important traits in reed canarygrass, this study presents one of the first applications of MI to genome-wide analyses and provides useful guidelines for conducting GWAS based on genotyping-by-sequencing data. Copyright © 2015 Ramstein et al.
Zahid, Faisal M; Heumann, Christian
Missing data is a common issue that can cause problems in estimation and inference in biomedical, epidemiological and social research. Multiple imputation is an increasingly popular approach for handling missing data. In case of a large number of covariates with missing data, existing multiple imputation software packages may not work properly and often produce errors. We propose a multiple imputation algorithm called mispr based on sequential penalized regression models. Each variable with missing values is assumed to have a different distributional form and is imputed with its own imputation model using the ridge penalty. In the case of a large number of predictors with respect to the sample size, the use of a quadratic penalty guarantees unique estimates for the parameters and leads to better predictions than the usual Maximum Likelihood Estimation (MLE), with a good compromise between bias and variance. As a result, the proposed algorithm performs well and provides imputed values that are better even for a large number of covariates with small samples. The results are compared with the existing software packages mice, VIM and Amelia in simulation studies. The missing at random mechanism was the main assumption in the simulation study. The imputation performance of the proposed algorithm is evaluated with mean squared imputation error and mean absolute imputation error. The mean squared error ([Formula: see text]), parameter estimates with their standard errors and confidence intervals are also computed to compare the performance in the regression context. The proposed algorithm is observed to be a good competitor to the existing algorithms, with smaller mean squared imputation error, mean absolute imputation error and mean squared error. The algorithm's performance becomes considerably better than that of the existing algorithms with increasing number of covariates, especially when the number of predictors is close to or even greater than the sample size. Two
Souverein, O. W.; Zwinderman, A. H.; Tanck, M. W. T.
The objective of this study was to investigate the performance of multiple imputation of missing genotype data for unrelated individuals using the polytomous logistic regression model, focusing on different missingness mechanisms, percentages of missing data, and imputation models. A complete
Full Text Available Role of, 29-non-synonymous, 15-intronic, 3-close to UTR, single nucleotide polymorphisms (SNPs and 2 mutations of Human Pyruvate Kinase (PK M2 were investigated by in-silico and in-vitro functional studies. Prediction of deleterious substitutions based on sequence homology and structure based servers, SIFT, PANTHER, SNPs&GO, PhD-SNP, SNAP and PolyPhen, depicted that 19% emerged common between all the mentioned programs. SNPeffect and HOPE showed three substitutions (C31F, Q310P and S437Y in-silico as deleterious and functionally important. In-vitro activity assays showed C31F and S437Y variants of PKM2 with reduced activity, while Q310P variant was catalytically inactive. The allosteric activation due to binding of fructose 1-6 bisphosphate (FBP was compromised in case of S437Y nsSNP variant protein. This was corroborated through molecular dynamics (MD simulation study, which was also carried out in other two variant proteins. The 5 intronic SNPs of PKM2, associated with sporadic breast cancer in a case-control study, when subjected to different computational analyses, indicated that 3 SNPs (rs2856929, rs8192381 and rs8192431 could generate an alternative transcript by influencing splicing factor binding to PKM2. We propose that these, potentially functional and important variations, both within exons and introns, could have a bearing on cancer metabolism, since PKM2 has been implicated in cancer in the recent past.
Kamitsuji, Shigeo; Matsuda, Takashi; Nishimura, Koichi; Endo, Seiko; Wada, Chisa; Watanabe, Kenji; Hasegawa, Koichi; Hishigaki, Haretsugu; Masuda, Masatoshi; Kuwahara, Yusuke; Tsuritani, Katsuki; Sugiura, Kenkichi; Kubota, Tomoko; Miyoshi, Shinji; Okada, Kinya; Nakazono, Kazuyuki; Sugaya, Yuki; Yang, Woosung; Sawamoto, Taiji; Uchida, Wataru; Shinagawa, Akira; Fujiwara, Tsutomu; Yamada, Hisaharu; Suematsu, Koji; Tsutsui, Naohisa; Kamatani, Naoyuki; Liou, Shyh-Yuh
Japan Pharmacogenomics Data Science Consortium (JPDSC) has assembled a database for conducting pharmacogenomics (PGx) studies in Japanese subjects. The database contains the genotypes of 2.5 million single-nucleotide polymorphisms (SNPs) and 5 human leukocyte antigen loci from 2994 Japanese healthy volunteers, as well as 121 kinds of clinical information, including self-reports, physiological data, hematological data and biochemical data. In this article, the reliability of our data was evaluated by principal component analysis (PCA) and association analysis for hematological and biochemical traits by using genome-wide SNP data. PCA of the SNPs showed that all the samples were collected from the Japanese population and that the samples were separated into two major clusters by birthplace, Okinawa and other than Okinawa, as had been previously reported. Among 87 SNPs that have been reported to be associated with 18 hematological and biochemical traits in genome-wide association studies (GWAS), the associations of 56 SNPs were replicated using our data base. Statistical power simulations showed that the sample size of the JPDSC control database is large enough to detect genetic markers having a relatively strong association even when the case sample size is small. The JPDSC database will be useful as control data for conducting PGx studies to explore genetic markers to improve the safety and efficacy of drugs either during clinical development or in post-marketing.
Yanan Hu; Qianqian Zhu; Maozai Tian
In this study, we consider the nonparametric quantile regression model with the covariates Missing at Random (MAR). Multiple imputation is becoming an increasingly popular approach for analyzing missing data, which combined with quantile regression is not well-developed. We propose an effective and accurate two-stage multiple imputation method for the model based on the quantile regression, which consists of initial imputation in the first stage and multiple imputation in the second stage. Th...
Morris, Tim P; White, Ian R; Royston, Patrick; Seaman, Shaun R; Wood, Angela M
We are concerned with multiple imputation of the ratio of two variables, which is to be used as a covariate in a regression analysis. If the numerator and denominator are not missing simultaneously, it seems sensible to make use of the observed variable in the imputation model. One such strategy is to impute missing values for the numerator and denominator, or the log-transformed numerator and denominator, and then calculate the ratio of interest; we call this 'passive' imputation. Alternatively, missing ratio values might be imputed directly, with or without the numerator and/or the denominator in the imputation model; we call this 'active' imputation. In two motivating datasets, one involving body mass index as a covariate and the other involving the ratio of total to high-density lipoprotein cholesterol, we assess the sensitivity of results to the choice of imputation model and, as an alternative, explore fully Bayesian joint models for the outcome and incomplete ratio. Fully Bayesian approaches using Winbugs were unusable in both datasets because of computational problems. In our first dataset, multiple imputation results are similar regardless of the imputation model; in the second, results are sensitive to the choice of imputation model. Sensitivity depends strongly on the coefficient of variation of the ratio's denominator. A simulation study demonstrates that passive imputation without transformation is risky because it can lead to downward bias when the coefficient of variation of the ratio's denominator is larger than about 0.1. Active imputation or passive imputation after log-transformation is preferable. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons, Ltd.
Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi
Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. Copyright © 2016 Elsevier Ltd. All rights reserved.
Zhao, Yize; Long, Qi
Missing data are frequently encountered in biomedical, epidemiologic and social research. It is well known that a naive analysis without adequate handling of missing data may lead to bias and/or loss of efficiency. Partly due to its ease of use, multiple imputation has become increasingly popular in practice for handling missing data. However, it is unclear what is the best strategy to conduct multiple imputation in the presence of high-dimensional data. To answer this question, we investigate several approaches of using regularized regression and Bayesian lasso regression to impute missing values in the presence of high-dimensional data. We compare the performance of these methods through numerical studies, in which we also evaluate the impact of the dimension of the data, the size of the true active set for imputation, and the strength of correlation. Our numerical studies show that in the presence of high-dimensional data the standard multiple imputation approach performs poorly and the imputation approach using Bayesian lasso regression achieves, in most cases, better performance than the other imputation methods including the standard imputation approach using the correctly specified imputation model. Our results suggest that Bayesian lasso regression and its extensions are better suited for multiple imputation in the presence of high-dimensional data than the other regression methods. © The Author(s) 2013.
Nguyen, Cattram D; Lee, Katherine J; Carlin, John B
Multiple imputation is gaining popularity as a strategy for handling missing data, but there is a scarcity of tools for checking imputation models, a critical step in model fitting. Posterior predictive checking (PPC) has been recommended as an imputation diagnostic. PPC involves simulating "replicated" data from the posterior predictive distribution of the model under scrutiny. Model fit is assessed by examining whether the analysis from the observed data appears typical of results obtained from the replicates produced by the model. A proposed diagnostic measure is the posterior predictive "p-value", an extreme value of which (i.e., a value close to 0 or 1) suggests a misfit between the model and the data. The aim of this study was to evaluate the performance of the posterior predictive p-value as an imputation diagnostic. Using simulation methods, we deliberately misspecified imputation models to determine whether posterior predictive p-values were effective in identifying these problems. When estimating the regression parameter of interest, we found that more extreme p-values were associated with poorer imputation model performance, although the results highlighted that traditional thresholds for classical p-values do not apply in this context. A shortcoming of the PPC method was its reduced ability to detect misspecified models with increasing amounts of missing data. Despite the limitations of posterior predictive p-values, they appear to have a valuable place in the imputer's toolkit. In addition to automated checking using p-values, we recommend imputers perform graphical checks and examine other summaries of the test quantity distribution. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Jason H Karnes
Full Text Available Imputation of human leukocyte antigen (HLA alleles from SNP-level data is attractive due to importance of HLA alleles in human disease, widespread availability of genome-wide association study (GWAS data, and expertise required for HLA sequencing. However, comprehensive evaluations of HLA imputations programs are limited. We compared HLA imputation results of HIBAG, SNP2HLA, and HLA*IMP:02 to sequenced HLA alleles in 3,265 samples from BioVU, a de-identified electronic health record database coupled to a DNA biorepository. We performed four-digit HLA sequencing for HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1 using long-read 454 FLX sequencing. All samples were genotyped using both the Illumina HumanExome BeadChip platform and a GWAS platform. Call rates and concordance rates were compared by platform, frequency of allele, and race/ethnicity. Overall concordance rates were similar between programs in European Americans (EA (0.975 [SNP2HLA]; 0.939 [HLA*IMP:02]; 0.976 [HIBAG]. SNP2HLA provided a significant advantage in terms of call rate and the number of alleles imputed. Concordance rates were lower overall for African Americans (AAs. These observations were consistent when accuracy was compared across HLA loci. All imputation programs performed similarly for low frequency HLA alleles. Higher concordance rates were observed when HLA alleles were imputed from GWAS platforms versus the HumanExome BeadChip, suggesting that high genomic coverage is preferred as input for HLA allelic imputation. These findings provide guidance on the best use of HLA imputation methods and elucidate their limitations.
Golabpour, Amin; Etminani, Kobra; Doosti, Hassan; Miri, Hamid Heidarian; Ghanbari, Reza
Missing values in data are found in a large number of studies in the field of medical sciences, especially longitudinal ones, in which repeated measurements are taken from each person during the study. In this regard, several statistical endeavors have been performed on the concepts, issues, and theoretical methods during the past few decades. Herein, we focused on the missing data related to patients excluded from longitudinal studies. To this end, two statistical parameters of similarity and correlation coefficient were employed. In addition, metaheuristic algorithms were applied to achieve an optimal solution. The selected metaheuristic algorithm, which has a great search functionality, was the Cuckoo search algorithm. Profiles of subjects with cervical dystonia (CD) were used to evaluate the proposed model after applying missingness. It was concluded that the algorithm used in this study had a higher accuracy (98.48%), compared with similar approaches. Concomitant use of similar parameters and correlation coefficients led to a significant increase in accuracy of missing data imputation.
Full Text Available Statistical imputation of classical HLA alleles in case-control studies has become established as a valuable tool for identifying and fine-mapping signals of disease association in the MHC. Imputation into diverse populations has, however, remained challenging, mainly because of the additional haplotypic heterogeneity introduced by combining reference panels of different sources. We present an HLA type imputation model, HLA*IMP:02, designed to operate on a multi-population reference panel. HLA*IMP:02 is based on a graphical representation of haplotype structure. We present a probabilistic algorithm to build such models for the HLA region, accommodating genotyping error, haplotypic heterogeneity and the need for maximum accuracy at the HLA loci, generalizing the work of Browning and Browning (2007 and Ron et al. (1998. HLA*IMP:02 achieves an average 4-digit imputation accuracy on diverse European panels of 97% (call rate 97%. On non-European samples, 2-digit performance is over 90% for most loci and ethnicities where data available. HLA*IMP:02 supports imputation of HLA-DPB1 and HLA-DRB3-5, is highly tolerant of missing data in the imputation panel and works on standard genotype data from popular genotyping chips. It is publicly available in source code and as a user-friendly web service framework.
Ølykke, Grith Skovgaard
exercised by the State, imputability to the State, and the State’s fulfilment of the Market Economy Investor Principle. Furthermore, it is examined whether, in the absence of imputability, public undertakings’ market behaviour is subject to the Market Economy Investor Principle, and it is concluded...... that this is not the case. Lastly, it is discussed whether other legal instruments, namely competition law, public procurement law, or the Transparency Directive, regulate public undertakings’ market behaviour. It is found that those rules are not sufficient to mend the gap created by the imputability requirement. Legal......In this article, the issue of impuability to the State of public undertakings’ decision-making is analysed and discussed in the context of the DSBFirst case. DSBFirst is owned by the independent public undertaking DSB and the private undertaking FirstGroup plc and won the contracts in the 2008...
Lee, S Hong; DeCandia, Teresa R; Ripke, Stephan
Schizophrenia is a complex disorder caused by both genetic and environmental factors. Using 9,087 affected individuals, 12,171 controls and 915,354 imputed SNPs from the Schizophrenia Psychiatric Genome-Wide Association Study (GWAS) Consortium (PGC-SCZ), we estimate that 23% (s.e. = 1......%) of variation in liability to schizophrenia is captured by SNPs. We show that a substantial proportion of this variation must be the result of common causal variants, that the variance explained by each chromosome is linearly related to its length (r = 0.89, P = 2.6 × 10(-8)), that the genetic basis...... of schizophrenia is the same in males and females, and that a disproportionate proportion of variation is attributable to a set of 2,725 genes expressed in the central nervous system (CNS; P = 7.6 × 10(-8)). These results are consistent with a polygenic genetic architecture and imply more individual SNP...
Hasan, Haliza; Ahmad, Sanizah; Osman, Balkish Mohd; Sapri, Shamsiah; Othman, Nadirah
In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness.
Susianto, Y.; Notodiputro, K. A.; Kurnia, A.; Wijayanto, H.
Missing values in repeated measurements have attracted concerns from researchers in the last few years. For many years, the standard statistical methods for repeated measurements have been developed assuming that the data was complete. The standard statistical methods cannot produce good estimates if the data suffered substantially by missing values. To overcome this problem the imputation methods could be used. This paper discusses three imputation methods namely the Yates method, expectation-maximization (EM) algorithm, and Markov Chain Monte Carlo (MCMC) method. These methods were used to estimate the missing values of per-capita expenditure data at sub-districts level in Central Java. The performance of these imputation methods is evaluated by comparing the mean square error (MSE) and mean absolute error (MAE) of the resulting estimates using linear mixed models. It is showed that MSE and MAE produced by the Yates method are lower than the MSE and MAE resulted from both the EM algorithm and the MCMC method. Therefore, the Yates method is recommended to impute the missing values of per capita expenditure at sub-district level.
Mbougua, Jules Brice Tchatchueng; Laurent, Christian; Ndoye, Ibra; Delaporte, Eric; Gwet, Henri; Molinari, Nicolas
Multiple imputation is commonly used to impute missing covariate in Cox semiparametric regression setting. It is to fill each missing data with more plausible values, via a Gibbs sampling procedure, specifying an imputation model for each missing variable. This imputation method is implemented in several softwares that offer imputation models steered by the shape of the variable to be imputed, but all these imputation models make an assumption of linearity on covariates effect. However, this assumption is not often verified in practice as the covariates can have a nonlinear effect. Such a linear assumption can lead to a misleading conclusion because imputation model should be constructed to reflect the true distributional relationship between the missing values and the observed values. To estimate nonlinear effects of continuous time invariant covariates in imputation model, we propose a method based on B-splines function. To assess the performance of this method, we conducted a simulation study, where we compared the multiple imputation method using Bayesian splines imputation model with multiple imputation using Bayesian linear imputation model in survival analysis setting. We evaluated the proposed method on the motivated data set collected in HIV-infected patients enrolled in an observational cohort study in Senegal, which contains several incomplete variables. We found that our method performs well to estimate hazard ratio compared with the linear imputation methods, when data are missing completely at random, or missing at random. Copyright © 2013 John Wiley & Sons, Ltd.
Cornish, R P; Macleod, J; Carpenter, J R; Tilling, K
When an outcome variable is missing not at random (MNAR: probability of missingness depends on outcome values), estimates of the effect of an exposure on this outcome are often biased. We investigated the extent of this bias and examined whether the bias can be reduced through incorporating proxy outcomes obtained through linkage to administrative data as auxiliary variables in multiple imputation (MI). Using data from the Avon Longitudinal Study of Parents and Children (ALSPAC) we estimated the association between breastfeeding and IQ (continuous outcome), incorporating linked attainment data (proxies for IQ) as auxiliary variables in MI models. Simulation studies explored the impact of varying the proportion of missing data (from 20 to 80%), the correlation between the outcome and its proxy (0.1-0.9), the strength of the missing data mechanism, and having a proxy variable that was incomplete. Incorporating a linked proxy for the missing outcome as an auxiliary variable reduced bias and increased efficiency in all scenarios, even when 80% of the outcome was missing. Using an incomplete proxy was similarly beneficial. High correlations (> 0.5) between the outcome and its proxy substantially reduced the missing information. Consistent with this, ALSPAC analysis showed inclusion of a proxy reduced bias and improved efficiency. Gains with additional proxies were modest. In longitudinal studies with loss to follow-up, incorporating proxies for this study outcome obtained via linkage to external sources of data as auxiliary variables in MI models can give practically important bias reduction and efficiency gains when the study outcome is MNAR.
Breen, Vivienne; Kasabov, Nikola; Kamat, Ashish M; Jacobson, Elsie; Suttie, James M; O'Sullivan, Paul J; Kavalieris, Laimonis; Darling, David G
Comparing the relative utility of diagnostic tests is challenging when available datasets are small, partial or incomplete. The analytical leverage associated with a large sample size can be gained by integrating several small datasets to enable effective and accurate across-dataset comparisons. Accordingly, we propose a methodology for a holistic comparative analysis and ranking of cancer diagnostic tests through dataset integration and imputation of missing values, using urothelial carcinoma (UC) as a case study. Five datasets comprising samples from 939 subjects, including 89 with UC, where up to four diagnostic tests (cytology, NMP22®, UroVysion® Fluorescence In-Situ Hybridization (FISH) and Cxbladder Detect) were integrated into a single dataset containing all measured records and missing values. The tests were firstly ranked using three criteria: sensitivity, specificity and a standard variable (feature) ranking method popularly known as signal-to-noise ratio (SNR) index derived from the mean values for all subjects clinically known to have UC versus healthy subjects. Secondly, step-wise unsupervised and supervised imputation (the latter accounting for the 'clinical truth' as determined by cystoscopy) was performed using personalized modelling, k-nearest-neighbour methods, multiple logistic regression and multilayer perceptron neural networks. All imputation models were cross-validated by comparing their post-imputation predictive accuracy for UC with their pre-imputation accuracy. Finally, the post-imputation tests were re-ranked using the same three criteria. In both measured and imputed data sets, Cxbladder Detect ranked higher for sensitivity, and urine cytology a higher specificity, when compared with other UC tests. Cxbladder Detect consistently ranked higher than FISH and all other tests when SNR analyses were performed on measured, unsupervised and supervised imputed datasets. Supervised imputation resulted in a smaller cross-validation error
Jennifer L Bolton
Full Text Available We examined whether a panel of SNPs, systematically selected from genome-wide association studies (GWAS, could improve risk prediction of coronary heart disease (CHD, over-and-above conventional risk factors. These SNPs have already demonstrated reproducible associations with CHD; here we examined their use in long-term risk prediction.SNPs identified from meta-analyses of GWAS of CHD were tested in 840 men and women aged 55-75 from the Edinburgh Artery Study, a prospective, population-based study with 15 years of follow-up. Cox proportional hazards models were used to evaluate the addition of SNPs to conventional risk factors in prediction of CHD risk. CHD was classified as myocardial infarction (MI, coronary intervention (angioplasty, or coronary artery bypass surgery, angina and/or unspecified ischaemic heart disease as a cause of death; additional analyses were limited to MI or coronary intervention. Model performance was assessed by changes in discrimination and net reclassification improvement (NRI.There were significant improvements with addition of 27 SNPs to conventional risk factors for prediction of CHD (NRI of 54%, P<0.001; C-index 0.671 to 0.740, P = 0.001, as well as MI or coronary intervention, (NRI of 44%, P<0.001; C-index 0.717 to 0.750, P = 0.256. ROC curves showed that addition of SNPs better improved discrimination when the sensitivity of conventional risk factors was low for prediction of MI or coronary intervention.There was significant improvement in risk prediction of CHD over 15 years when SNPs identified from GWAS were added to conventional risk factors. This effect may be particularly useful for identifying individuals with a low prognostic index who are in fact at increased risk of disease than indicated by conventional risk factors alone.
Asendorpf, Jens B.; van de Schoot, Rens; Denissen, Jaap J. A.; Hutteman, Roos
Most longitudinal studies are plagued by drop-out related to variables at earlier assessments (systematic attrition). Although systematic attrition is often analysed in longitudinal studies, surprisingly few researchers attempt to reduce biases due to systematic attrition, even though this is possible and nowadays technically easy. This is…
Asendorpf, Jens B.; van de Schoot, Rens; Denissen, Jaap J. A.; Hutteman, Roos
Most longitudinal studies are plagued by drop-out related to variables at earlier assessments (systematic attrition). Although systematic attrition is often analysed in longitudinal studies, surprisingly few researchers attempt to reduce biases due to systematic attrition, even though this is
Oliveira Júnior, Gerson A; Chud, Tatiane C S; Ventura, Ricardo V; Garrick, Dorian J; Cole, John B; Munari, Danísio P; Ferraz, José B S; Mullart, Erik; DeNise, Sue; Smith, Shannon; da Silva, Marcos Vinícius G B
The objective of this study was to investigate different strategies for genotype imputation in a population of crossbred Girolando (Gyr × Holstein) dairy cattle. The data set consisted of 478 Girolando, 583 Gyr, and 1,198 Holstein sires genotyped at high density with the Illumina BovineHD (Illumina, San Diego, CA) panel, which includes ∼777K markers. The accuracy of imputation from low (20K) and medium densities (50K and 70K) to the HD panel density and from low to 50K density were investigated. Seven scenarios using different reference populations (RPop) considering Girolando, Gyr, and Holstein breeds separately or combinations of animals of these breeds were tested for imputing genotypes of 166 randomly chosen Girolando animals. The population genotype imputation were performed using FImpute. Imputation accuracy was measured as the correlation between observed and imputed genotypes (CORR) and also as the proportion of genotypes that were imputed correctly (CR). This is the first paper on imputation accuracy in a Girolando population. The sample-specific imputation accuracies ranged from 0.38 to 0.97 (CORR) and from 0.49 to 0.96 (CR) imputing from low and medium densities to HD, and 0.41 to 0.95 (CORR) and from 0.50 to 0.94 (CR) for imputation from 20K to 50K. The CORR anim exceeded 0.96 (for 50K and 70K panels) when only Girolando animals were included in RPop (S1). We found smaller CORR anim when Gyr (S2) was used instead of Holstein (S3) as RPop. The same behavior was observed between S4 (Gyr + Girolando) and S5 (Holstein + Girolando) because the target animals were more related to the Holstein population than to the Gyr population. The highest imputation accuracies were observed for scenarios including Girolando animals in the reference population, whereas using only Gyr animals resulted in low imputation accuracies, suggesting that the haplotypes segregating in the Girolando population had a greater effect on accuracy than the purebred haplotypes. All
Singh, Preety K; Mistry, Kinnari N; Chiramana, Haritha; Rank, Dharamshi N; Joshi, Chaitanya G
Non-homologous end joining (NHEJ) pathway has pivotal role in repair of double-strand DNA breaks that may lead to carcinogenesis. XRCC4 is one of the essential proteins of this pathway and single-nucleotide polymorphisms (SNPs) of this gene are reported to be associated with cancer risks. In our study, we first used computational approaches to predict the damaging variants of XRCC4 gene. Tools predicted rs79561451 (S110P) nsSNP as the most deleterious SNP. Along with this SNP, we analysed other two SNPs (rs3734091 and rs6869366) to study their association with breast cancer in population of West India. Variant rs3734091 was found to be significantly associated with breast cancer while rs6869366 variant did not show any association. These SNPs may influence the susceptibility of individuals to breast cancer in this population. Copyright © 2018 Elsevier B.V. All rights reserved.
O'Keeffe, Aidan G; Farewell, Daniel M; Tom, Brian D M; Farewell, Vernon T
In longitudinal randomised trials and observational studies within a medical context, a composite outcome-which is a function of several individual patient-specific outcomes-may be felt to best represent the outcome of interest. As in other contexts, missing data on patient outcome, due to patient drop-out or for other reasons, may pose a problem. Multiple imputation is a widely used method for handling missing data, but its use for composite outcomes has been seldom discussed. Whilst standard multiple imputation methodology can be used directly for the composite outcome, the distribution of a composite outcome may be of a complicated form and perhaps not amenable to statistical modelling. We compare direct multiple imputation of a composite outcome with separate imputation of the components of a composite outcome. We consider two imputation approaches. One approach involves modelling each component of a composite outcome using standard likelihood-based models. The other approach is to use linear increments methods. A linear increments approach can provide an appealing alternative as assumptions concerning both the missingness structure within the data and the imputation models are different from the standard likelihood-based approach. We compare both approaches using simulation studies and data from a randomised trial on early rheumatoid arthritis patients. Results suggest that both approaches are comparable and that for each, separate imputation offers some improvement on the direct imputation of a composite outcome.
Boscoe Francis P
Full Text Available Abstract Background To reduce the number of non-geocoded cases researchers and organizations sometimes include cases geocoded to postal code centroids along with cases geocoded with the greater precision of a full street address. Some analysts then use the postal code to assign information to the cases from finer-level geographies such as a census tract. Assignment is commonly completed using either a postal centroid or by a geographical imputation method which assigns a location by using both the demographic characteristics of the case and the population characteristics of the postal delivery area. To date no systematic evaluation of geographical imputation methods ("geo-imputation" has been completed. The objective of this study was to determine the accuracy of census tract assignment using geo-imputation. Methods Using a large dataset of breast, prostate and colorectal cancer cases reported to the New Jersey Cancer Registry, we determined how often cases were assigned to the correct census tract using alternate strategies of demographic based geo-imputation, and using assignments obtained from postal code centroids. Assignment accuracy was measured by comparing the tract assigned with the tract originally identified from the full street address. Results Assigning cases to census tracts using the race/ethnicity population distribution within a postal code resulted in more correctly assigned cases than when using postal code centroids. The addition of age characteristics increased the match rates even further. Match rates were highly dependent on both the geographic distribution of race/ethnicity groups and population density. Conclusion Geo-imputation appears to offer some advantages and no serious drawbacks as compared with the alternative of assigning cases to census tracts based on postal code centroids. For a specific analysis, researchers will still need to consider the potential impact of geocoding quality on their results and evaluate
Morisot, Adeline; Bessaoud, Fa?za; Landais, Paul; R?billard, Xavier; Tr?tarre, Brigitte; Daur?s, Jean-Pierre
Background Estimations of survival rates are diverse and the choice of the appropriate method depends on the context. Given the increasing interest in multiple imputation methods, we explored the interest of a multiple imputation approach in the estimation of cause-specific survival, when a subset of causes of death was observed. Methods By using European Randomized Study of Screening for Prostate Cancer (ERSPC), 20 multiply imputed datasets were created and analyzed with a Multivariate Imput...
Multiple imputation (MI) is an advanced technique for handing missing values. It is superior to single imputation in that it takes into account uncertainty in missing value imputation. However, MI is underutilized in medical literature due to lack of familiarity and computational challenges. The article provides a step-by-step approach to perform MI by using R multivariate imputation by chained equation (MICE) package. The procedure firstly imputed m sets of complete dataset by calling mice() function. Then statistical analysis such as univariate analysis and regression model can be performed within each dataset by calling with() function. This function sets the environment for statistical analysis. Lastly, the results obtained from each analysis are combined by using pool() function.
Sullivan, Thomas R; Lee, Katherine J; Ryan, Philip; Salter, Amy B
Multiple imputation is a popular approach to handling missing data in medical research, yet little is known about its applicability for estimating the relative risk. Standard methods for imputing incomplete binary outcomes involve logistic regression or an assumption of multivariate normality, whereas relative risks are typically estimated using log binomial models. It is unclear whether misspecification of the imputation model in this setting could lead to biased parameter estimates. Using simulated data, we evaluated the performance of multiple imputation for handling missing data prior to estimating adjusted relative risks from a correctly specified multivariable log binomial model. We considered an arbitrary pattern of missing data in both outcome and exposure variables, with missing data induced under missing at random mechanisms. Focusing on standard model-based methods of multiple imputation, missing data were imputed using multivariate normal imputation or fully conditional specification with a logistic imputation model for the outcome. Multivariate normal imputation performed poorly in the simulation study, consistently producing estimates of the relative risk that were biased towards the null. Despite outperforming multivariate normal imputation, fully conditional specification also produced somewhat biased estimates, with greater bias observed for higher outcome prevalences and larger relative risks. Deleting imputed outcomes from analysis datasets did not improve the performance of fully conditional specification. Both multivariate normal imputation and fully conditional specification produced biased estimates of the relative risk, presumably since both use a misspecified imputation model. Based on simulation results, we recommend researchers use fully conditional specification rather than multivariate normal imputation and retain imputed outcomes in the analysis when estimating relative risks. However fully conditional specification is not without its
Tapiador Sanjuán, M J
Validity, efficacy and responsibility of acts depend on the intelligence and will of the acting subject; therefore when they are reduced or debilitated, these acts may be declared as non-valid and the author, not-responsible for the acts. Some neurological pathologies may generate physical and/or psychic permanent deficiencies, which prevent subjects from acting on their own. For these cases, the law establishes the incapacity state, in order to protect the disabled and complete the reduced ability, guaranteeing their rights and security. The disabled state will be determined by a legal sentence, which states the lack of ability to manage. In that sentence extension and limits of the disability will be determined; disability level will be proportional to the insight degree.Similarly, a subject suffering a pathological condition that invalidates his/her will and intelligence will be considered non-responsible and not imputable, since there is no culpability ability. The Penal Code establishes the criteria that will determine the possibility of imputability or its absence, as well as modifying circumstances.
Doidge, James C
Population-based cohort studies are invaluable to health research because of the breadth of data collection over time, and the representativeness of their samples. However, they are especially prone to missing data, which can compromise the validity of analyses when data are not missing at random. Having many waves of data collection presents opportunity for participants' responsiveness to be observed over time, which may be informative about missing data mechanisms and thus useful as an auxiliary variable. Modern approaches to handling missing data such as multiple imputation and maximum likelihood can be difficult to implement with the large numbers of auxiliary variables and large amounts of non-monotone missing data that occur in cohort studies. Inverse probability-weighting can be easier to implement but conventional wisdom has stated that it cannot be applied to non-monotone missing data. This paper describes two methods of applying inverse probability-weighting to non-monotone missing data, and explores the potential value of including measures of responsiveness in either inverse probability-weighting or multiple imputation. Simulation studies are used to compare methods and demonstrate that responsiveness in longitudinal studies can be used to mitigate bias induced by missing data, even when data are not missing at random.
Full Text Available Many young people experiment with cannabis, yet only a subgroup progress to dependence suggesting individual differences that could relate to factors such as genetics and behavioral traits. Dopamine receptor D2 (DRD2 and proenkephalin (PENK genes have been implicated in animal studies with cannabis exposure. Whether polymorphisms of these genes are associated with cannabis dependence and related behavioral traits is unknown.Healthy young adults (18-27 years with cannabis dependence and without a dependence diagnosis were studied (N = 50/group in relation to a priori-determined single nucleotide polymorphisms (SNPs of the DRD2 and PENK genes. Negative affect, Impulsive Risk Taking and Neuroticism-Anxiety temperamental traits, positive and negative reward-learning performance and stop-signal reaction times were examined. The findings replicated the known association between the rs6277 DRD2 SNP and decisions associated with negative reinforcement outcomes. Moreover, PENK variants (rs2576573 and rs2609997 significantly related to Neuroticism and cannabis dependence. Cigarette smoking is common in cannabis users, but it was not associated to PENK SNPs as also validated in another cohort (N = 247 smokers, N = 312 non-smokers. Neuroticism mediated (15.3%-19.5% the genetic risk to cannabis dependence and interacted with risk SNPs, resulting in a 9-fold increase risk for cannabis dependence. Molecular characterization of the postmortem human brain in a different population revealed an association between PENK SNPs and PENK mRNA expression in the central amygdala nucleus emphasizing the functional relevance of the SNPs in a brain region strongly linked to negative affect.Overall, the findings suggest an important role for Neuroticism as an endophenotype linking PENK polymorphisms to cannabis-dependence vulnerability synergistically amplifying the apparent genetic risk.
The effects of reference population size and the availability of information from genotyped ancestors on the accuracy of imputation of single nucleotide polymorphisms (SNPs) were investigated for Mexican Holstein cattle. Three scenarios for reference population size were examined: (1) a local popula...
Frazer, Kelly A; Ballinger, Dennis G; Cox, David R; Hinds, David A; Stuve, Laura L; Gibbs, Richard A; Belmont, John W; Boudreau, Andrew; Hardenbol, Paul; Leal, Suzanne M; Pasternak, Shiran; Wheeler, David A; Willis, Thomas D; Yu, Fuli; Yang, Huanming; Zeng, Changqing; Gao, Yang; Hu, Haoran; Hu, Weitao; Li, Chaohua; Lin, Wei; Liu, Siqi; Pan, Hao; Tang, Xiaoli; Wang, Jian; Wang, Wei; Yu, Jun; Zhang, Bo; Zhang, Qingrun; Zhao, Hongbin; Zhao, Hui; Zhou, Jun; Gabriel, Stacey B; Barry, Rachel; Blumenstiel, Brendan; Camargo, Amy; Defelice, Matthew; Faggart, Maura; Goyette, Mary; Gupta, Supriya; Moore, Jamie; Nguyen, Huy; Onofrio, Robert C; Parkin, Melissa; Roy, Jessica; Stahl, Erich; Winchester, Ellen; Ziaugra, Liuda; Altshuler, David; Shen, Yan; Yao, Zhijian; Huang, Wei; Chu, Xun; He, Yungang; Jin, Li; Liu, Yangfan; Shen, Yayun; Sun, Weiwei; Wang, Haifeng; Wang, Yi; Wang, Ying; Xiong, Xiaoyan; Xu, Liang; Waye, Mary M Y; Tsui, Stephen K W; Xue, Hong; Wong, J Tze-Fei; Galver, Luana M; Fan, Jian-Bing; Gunderson, Kevin; Murray, Sarah S; Oliphant, Arnold R; Chee, Mark S; Montpetit, Alexandre; Chagnon, Fanny; Ferretti, Vincent; Leboeuf, Martin; Olivier, Jean-François; Phillips, Michael S; Roumy, Stéphanie; Sallée, Clémentine; Verner, Andrei; Hudson, Thomas J; Kwok, Pui-Yan; Cai, Dongmei; Koboldt, Daniel C; Miller, Raymond D; Pawlikowska, Ludmila; Taillon-Miller, Patricia; Xiao, Ming; Tsui, Lap-Chee; Mak, William; Song, You Qiang; Tam, Paul K H; Nakamura, Yusuke; Kawaguchi, Takahisa; Kitamoto, Takuya; Morizono, Takashi; Nagashima, Atsushi; Ohnishi, Yozo; Sekine, Akihiro; Tanaka, Toshihiro; Tsunoda, Tatsuhiko; Deloukas, Panos; Bird, Christine P; Delgado, Marcos; Dermitzakis, Emmanouil T; Gwilliam, Rhian; Hunt, Sarah; Morrison, Jonathan; Powell, Don; Stranger, Barbara E; Whittaker, Pamela; Bentley, David R; Daly, Mark J; de Bakker, Paul I W; Barrett, Jeff; Chretien, Yves R; Maller, Julian; McCarroll, Steve; Patterson, Nick; Pe'er, Itsik; Price, Alkes; Purcell, Shaun; Richter, Daniel J; Sabeti, Pardis; Saxena, Richa; Schaffner, Stephen F; Sham, Pak C; Varilly, Patrick; Altshuler, David; Stein, Lincoln D; Krishnan, Lalitha; Smith, Albert Vernon; Tello-Ruiz, Marcela K; Thorisson, Gudmundur A; Chakravarti, Aravinda; Chen, Peter E; Cutler, David J; Kashuk, Carl S; Lin, Shin; Abecasis, Gonçalo R; Guan, Weihua; Li, Yun; Munro, Heather M; Qin, Zhaohui Steve; Thomas, Daryl J; McVean, Gilean; Auton, Adam; Bottolo, Leonardo; Cardin, Niall; Eyheramendy, Susana; Freeman, Colin; Marchini, Jonathan; Myers, Simon; Spencer, Chris; Stephens, Matthew; Donnelly, Peter; Cardon, Lon R; Clarke, Geraldine; Evans, David M; Morris, Andrew P; Weir, Bruce S; Tsunoda, Tatsuhiko; Mullikin, James C; Sherry, Stephen T; Feolo, Michael; Skol, Andrew; Zhang, Houcan; Zeng, Changqing; Zhao, Hui; Matsuda, Ichiro; Fukushima, Yoshimitsu; Macer, Darryl R; Suda, Eiko; Rotimi, Charles N; Adebamowo, Clement A; Ajayi, Ike; Aniagwu, Toyin; Marshall, Patricia A; Nkwodimmah, Chibuzor; Royal, Charmaine D M; Leppert, Mark F; Dixon, Missy; Peiffer, Andy; Qiu, Renzong; Kent, Alastair; Kato, Kazuto; Niikawa, Norio; Adewole, Isaac F; Knoppers, Bartha M; Foster, Morris W; Clayton, Ellen Wright; Watkin, Jessica; Gibbs, Richard A; Belmont, John W; Muzny, Donna; Nazareth, Lynne; Sodergren, Erica; Weinstock, George M; Wheeler, David A; Yakub, Imtaz; Gabriel, Stacey B; Onofrio, Robert C; Richter, Daniel J; Ziaugra, Liuda; Birren, Bruce W; Daly, Mark J; Altshuler, David; Wilson, Richard K; Fulton, Lucinda L; Rogers, Jane; Burton, John; Carter, Nigel P; Clee, Christopher M; Griffiths, Mark; Jones, Matthew C; McLay, Kirsten; Plumb, Robert W; Ross, Mark T; Sims, Sarah K; Willey, David L; Chen, Zhu; Han, Hua; Kang, Le; Godbout, Martin; Wallenburg, John C; L'Archevêque, Paul; Bellemare, Guy; Saeki, Koji; Wang, Hongguang; An, Daochang; Fu, Hongbo; Li, Qing; Wang, Zhen; Wang, Renwu; Holden, Arthur L; Brooks, Lisa D; McEwen, Jean E; Guyer, Mark S; Wang, Vivian Ota; Peterson, Jane L; Shi, Michael; Spiegel, Jack; Sung, Lawrence M; Zacharia, Lynn F; Collins, Francis S; Kennedy, Karen; Jamieson, Ruth; Stewart, John
We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.
Rawlings, Andreea Monica; Sang, Yingying; Sharrett, Albert Richey; Coresh, Josef; Griswold, Michael; Kucharska-Newton, Anna Maria; Palta, Priya; Wruck, Lisa Miller; Gross, Alden Lawrence; Deal, Jennifer Anne; Power, Melinda Carolyn; Bandeen-Roche, Karen Jean
Longitudinal studies of cognitive performance are sensitive to dropout, as participants experiencing cognitive deficits are less likely to attend study visits, which may bias estimated associations between exposures of interest and cognitive decline. Multiple imputation is a powerful tool for handling missing data, however its use for missing cognitive outcome measures in longitudinal analyses remains limited. We use multiple imputation by chained equations (MICE) to impute cognitive performance scores of participants who did not attend the 2011-2013 exam of the Atherosclerosis Risk in Communities Study. We examined the validity of imputed scores using observed and simulated data under varying assumptions. We examined differences in the estimated association between diabetes at baseline and 20-year cognitive decline with and without imputed values. Lastly, we discuss how different analytic methods (mixed models and models fit using generalized estimate equations) and choice of for whom to impute result in different estimands. Validation using observed data showed MICE produced unbiased imputations. Simulations showed a substantial reduction in the bias of the 20-year association between diabetes and cognitive decline comparing MICE (3-4 % bias) to analyses of available data only (16-23 % bias) in a construct where missingness was strongly informative but realistic. Associations between diabetes and 20-year cognitive decline were substantially stronger with MICE than in available-case analyses. Our study suggests when informative data are available for non-examined participants, MICE can be an effective tool for imputing cognitive performance and improving assessment of cognitive decline, though careful thought should be given to target imputation population and analytic model chosen, as they may yield different estimands.
This dissertation focuses on finding plausible imputations when there is some restriction posed on the imputation model. In these restrictive situations, current imputation methodology does not lead to satisfactory imputations. The restrictions, and the resulting missing data problems are real-life
DiazOrdaz, K; Kenward, M G; Gomes, M; Grieve, R
Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single-level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost-effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing-at-random clustered data scenarios were simulated following a full-factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed-effects multiple imputation and too low following single-level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Full Text Available Abstract Background Differences in the genetic architecture of inflammatory bowel disease between different European countries and ethnicities have previously been reported. In the present study, we wanted to assess the role of 11 newly identified UC risk variants, derived from a recent European UC genome wide association study (GWAS (Franke et al., 2010, for 1 association with UC in the Nordic countries, 2 for population heterogeneity between the Nordic countries and the rest of Europe, and, 3 eventually, to drive some of the previous findings towards overall genome-wide significance. Methods Eleven SNPs were replicated in a Danish sample consisting of 560 UC patients and 796 controls and nine missing SNPs of the German GWAS study were successfully genotyped in the Baltic sample comprising 441 UC cases and 1156 controls. The independent replication data was then jointly analysed with the original data and systematic comparisons of the findings between ethnicities were made. Pearson's χ2, Breslow-Day (BD and Cochran-Mantel-Haenszel (CMH tests were used for association analyses and heterogeneity testing. Results The rs5771069 (IL17REL SNP was not associated with UC in the Danish panel. The rs5771069 (IL17REL SNP was significantly associated with UC in the combined Baltic, Danish and Norwegian UC study sample driven by the Norwegian panel (OR = 0.89, 95% CI: 0.79-0.98, P = 0.02. No association was found between rs7809799 (SMURF1/KPNA7 and UC (OR = 1.20, 95% CI: 0.95-1.52, P = 0.10 or between UC and all other remaining SNPs. We had 94% chance of detecting an association for rs7809799 (SMURF1/KPNA7 in the combined replication sample, whereas the power were 55% or lower for the remaining SNPs. Statistically significant PBD was found for OR heterogeneity between the combined Baltic, Danish, and Norwegian panel versus the combined German, British, Belgian, and Greek panel (rs7520292 (P = 0.001, rs12518307 (P = 0.007, and rs2395609 (TCP11 (P = 0
Lee, Katherine J; Carlin, John B
Multiple imputation (MI) is becoming increasingly popular for handling missing data. Standard approaches for MI assume normality for continuous variables (conditionally on the other variables in the imputation model). However, it is unclear how to impute non-normally distributed continuous variables. Using simulation and a case study, we compared various transformations applied prior to imputation, including a novel non-parametric transformation, to imputation on the raw scale and using predictive mean matching (PMM) when imputing non-normal data. We generated data from a range of non-normal distributions, and set 50% to missing completely at random or missing at random. We then imputed missing values on the raw scale, following a zero-skewness log, Box-Cox or non-parametric transformation and using PMM with both type 1 and 2 matching. We compared inferences regarding the marginal mean of the incomplete variable and the association with a fully observed outcome. We also compared results from these approaches in the analysis of depression and anxiety symptoms in parents of very preterm compared with term-born infants. The results provide novel empirical evidence that the decision regarding how to impute a non-normal variable should be based on the nature of the relationship between the variables of interest. If the relationship is linear in the untransformed scale, transformation can introduce bias irrespective of the transformation used. However, if the relationship is non-linear, it may be important to transform the variable to accurately capture this relationship. A useful alternative is to impute the variable using PMM with type 1 matching. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Jose Antonio Lopez
Full Text Available Several imputation approaches using a large sample and different levels of censoring are compared and contrasted following a multiple imputation methodology. The study not only discusses these imputation approaches, but also quantifies differences in price variability before and after price imputation, evaluates the performance of each method, and estimates and compares parameters and elasticities from a complete demand system. The study’s findings reveal that small variability among the mean prices from the various imputation approaches may result in relatively larger variability among the underlying parameter estimates of interest and the ultimately desired measures. This suggests that selection bias may be avoided or reduced by validating the imputation approaches and choosing the imputation method based on an analysis of the ultimately desired measures.
Morris, Tim P; White, Ian R; Royston, Patrick
Multiple imputation is a commonly used method for handling incomplete covariates as it can provide valid inference when data are missing at random. This depends on being able to correctly specify the parametric model used to impute missing values, which may be difficult in many realistic settings. Imputation by predictive mean matching (PMM) borrows an observed value from a donor with a similar predictive mean; imputation by local residual draws (LRD) instead borrows the donor's residual. Both methods relax some assumptions of parametric imputation, promising greater robustness when the imputation model is misspecified. We review development of PMM and LRD and outline the various forms available, and aim to clarify some choices about how and when they should be used. We compare performance to fully parametric imputation in simulation studies, first when the imputation model is correctly specified and then when it is misspecified. In using PMM or LRD we strongly caution against using a single donor, the default value in some implementations, and instead advocate sampling from a pool of around 10 donors. We also clarify which matching metric is best. Among the current MI software there are several poor implementations. PMM and LRD may have a role for imputing covariates (i) which are not strongly associated with outcome, and (ii) when the imputation model is thought to be slightly but not grossly misspecified. Researchers should spend efforts on specifying the imputation model correctly, rather than expecting predictive mean matching or local residual draws to do the work.
Kerns, Sarah L.; Ostrer, Harry; Stock, Richard; Li, William; Moore, Julian; Pearlman, Alexander; Campbell, Christopher; Shao Yongzhao; Stone, Nelson; Kusnetz, Lynda; Rosenstein, Barry S.
Purpose: To identify single nucleotide polymorphisms (SNPs) associated with erectile dysfunction (ED) among African-American prostate cancer patients treated with external beam radiation therapy. Methods and Materials: A cohort of African-American prostate cancer patients treated with external beam radiation therapy was observed for the development of ED by use of the five-item Sexual Health Inventory for Men (SHIM) questionnaire. Final analysis included 27 cases (post-treatment SHIM score ≤7) and 52 control subjects (post-treatment SHIM score ≥16). A genome-wide association study was performed using approximately 909,000 SNPs genotyped on Affymetrix 6.0 arrays (Affymetrix, Santa Clara, CA). Results: We identified SNP rs2268363, located in the follicle-stimulating hormone receptor (FSHR) gene, as significantly associated with ED after correcting for multiple comparisons (unadjusted p = 5.46 x 10 -8 , Bonferroni p = 0.028). We identified four additional SNPs that tended toward a significant association with an unadjusted p value -6 . Inference of population substructure showed that cases had a higher proportion of African ancestry than control subjects (77% vs. 60%, p = 0.005). A multivariate logistic regression model that incorporated estimated ancestry and four of the top-ranked SNPs was a more accurate classifier of ED than a model that included only clinical variables. Conclusions: To our knowledge, this is the first genome-wide association study to identify SNPs associated with adverse effects resulting from radiotherapy. It is important to note that the SNP that proved to be significantly associated with ED is located within a gene whose encoded product plays a role in male gonad development and function. Another key finding of this project is that the four SNPs most strongly associated with ED were specific to persons of African ancestry and would therefore not have been identified had a cohort of European ancestry been screened. This study demonstrates
Maggie C Y Ng
Full Text Available Genome-wide association studies (GWAS have identified >300 loci associated with measures of adiposity including body mass index (BMI and waist-to-hip ratio (adjusted for BMI, WHRadjBMI, but few have been identified through screening of the African ancestry genomes. We performed large scale meta-analyses and replications in up to 52,895 individuals for BMI and up to 23,095 individuals for WHRadjBMI from the African Ancestry Anthropometry Genetics Consortium (AAAGC using 1000 Genomes phase 1 imputed GWAS to improve coverage of both common and low frequency variants in the low linkage disequilibrium African ancestry genomes. In the sex-combined analyses, we identified one novel locus (TCF7L2/HABP2 for WHRadjBMI and eight previously established loci at P < 5×10-8: seven for BMI, and one for WHRadjBMI in African ancestry individuals. An additional novel locus (SPRYD7/DLEU2 was identified for WHRadjBMI when combined with European GWAS. In the sex-stratified analyses, we identified three novel loci for BMI (INTS10/LPL and MLC1 in men, IRX4/IRX2 in women and four for WHRadjBMI (SSX2IP, CASC8, PDE3B and ZDHHC1/HSD11B2 in women in individuals of African ancestry or both African and European ancestry. For four of the novel variants, the minor allele frequency was low (<5%. In the trans-ethnic fine mapping of 47 BMI loci and 27 WHRadjBMI loci that were locus-wide significant (P < 0.05 adjusted for effective number of variants per locus from the African ancestry sex-combined and sex-stratified analyses, 26 BMI loci and 17 WHRadjBMI loci contained ≤ 20 variants in the credible sets that jointly account for 99% posterior probability of driving the associations. The lead variants in 13 of these loci had a high probability of being causal. As compared to our previous HapMap imputed GWAS for BMI and WHRadjBMI including up to 71,412 and 27,350 African ancestry individuals, respectively, our results suggest that 1000 Genomes imputation showed modest improvement
Ng, Maggie C Y; Graff, Mariaelisa; Lu, Yingchang; Justice, Anne E; Mudgal, Poorva; Liu, Ching-Ti; Young, Kristin; Yanek, Lisa R; Feitosa, Mary F; Wojczynski, Mary K; Rand, Kristin; Brody, Jennifer A; Cade, Brian E; Dimitrov, Latchezar; Duan, Qing; Guo, Xiuqing; Lange, Leslie A; Nalls, Michael A; Okut, Hayrettin; Tajuddin, Salman M; Tayo, Bamidele O; Vedantam, Sailaja; Bradfield, Jonathan P; Chen, Guanjie; Chen, Wei-Min; Chesi, Alessandra; Irvin, Marguerite R; Padhukasahasram, Badri; Smith, Jennifer A; Zheng, Wei; Allison, Matthew A; Ambrosone, Christine B; Bandera, Elisa V; Bartz, Traci M; Berndt, Sonja I; Bernstein, Leslie; Blot, William J; Bottinger, Erwin P; Carpten, John; Chanock, Stephen J; Chen, Yii-Der Ida; Conti, David V; Cooper, Richard S; Fornage, Myriam; Freedman, Barry I; Garcia, Melissa; Goodman, Phyllis J; Hsu, Yu-Han H; Hu, Jennifer; Huff, Chad D; Ingles, Sue A; John, Esther M; Kittles, Rick; Klein, Eric; Li, Jin; McKnight, Barbara; Nayak, Uma; Nemesure, Barbara; Ogunniyi, Adesola; Olshan, Andrew; Press, Michael F; Rohde, Rebecca; Rybicki, Benjamin A; Salako, Babatunde; Sanderson, Maureen; Shao, Yaming; Siscovick, David S; Stanford, Janet L; Stevens, Victoria L; Stram, Alex; Strom, Sara S; Vaidya, Dhananjay; Witte, John S; Yao, Jie; Zhu, Xiaofeng; Ziegler, Regina G; Zonderman, Alan B; Adeyemo, Adebowale; Ambs, Stefan; Cushman, Mary; Faul, Jessica D; Hakonarson, Hakon; Levin, Albert M; Nathanson, Katherine L; Ware, Erin B; Weir, David R; Zhao, Wei; Zhi, Degui; Arnett, Donna K; Grant, Struan F A; Kardia, Sharon L R; Oloapde, Olufunmilayo I; Rao, D C; Rotimi, Charles N; Sale, Michele M; Williams, L Keoki; Zemel, Babette S; Becker, Diane M; Borecki, Ingrid B; Evans, Michele K; Harris, Tamara B; Hirschhorn, Joel N; Li, Yun; Patel, Sanjay R; Psaty, Bruce M; Rotter, Jerome I; Wilson, James G; Bowden, Donald W; Cupples, L Adrienne; Haiman, Christopher A; Loos, Ruth J F; North, Kari E
Genome-wide association studies (GWAS) have identified >300 loci associated with measures of adiposity including body mass index (BMI) and waist-to-hip ratio (adjusted for BMI, WHRadjBMI), but few have been identified through screening of the African ancestry genomes. We performed large scale meta-analyses and replications in up to 52,895 individuals for BMI and up to 23,095 individuals for WHRadjBMI from the African Ancestry Anthropometry Genetics Consortium (AAAGC) using 1000 Genomes phase 1 imputed GWAS to improve coverage of both common and low frequency variants in the low linkage disequilibrium African ancestry genomes. In the sex-combined analyses, we identified one novel locus (TCF7L2/HABP2) for WHRadjBMI and eight previously established loci at P African ancestry individuals. An additional novel locus (SPRYD7/DLEU2) was identified for WHRadjBMI when combined with European GWAS. In the sex-stratified analyses, we identified three novel loci for BMI (INTS10/LPL and MLC1 in men, IRX4/IRX2 in women) and four for WHRadjBMI (SSX2IP, CASC8, PDE3B and ZDHHC1/HSD11B2 in women) in individuals of African ancestry or both African and European ancestry. For four of the novel variants, the minor allele frequency was low (African ancestry sex-combined and sex-stratified analyses, 26 BMI loci and 17 WHRadjBMI loci contained ≤ 20 variants in the credible sets that jointly account for 99% posterior probability of driving the associations. The lead variants in 13 of these loci had a high probability of being causal. As compared to our previous HapMap imputed GWAS for BMI and WHRadjBMI including up to 71,412 and 27,350 African ancestry individuals, respectively, our results suggest that 1000 Genomes imputation showed modest improvement in identifying GWAS loci including low frequency variants. Trans-ethnic meta-analyses further improved fine mapping of putative causal variants in loci shared between the African and European ancestry populations.
Wang, Jiu-Yao; Liou, Ya-Huei; Wu, Ying-Jye; Hsiao, Ya-Hsin; Wu, Lawrence Shih-Hsin
Asthma is one of the most common chronic diseases in children. It is caused by complex interactions between various genetic factors and exposures to environmental allergens and irritants. Because of the heterogeneity of the disease and the genetic and cultural differences among different populations, a proper association study and genetic testing for asthma and susceptibility genes is difficult to perform. We assessed 13 single-nucleotide polymorphisms (SNPs) in seven well-known asthma susceptibility genes and looked for association with pediatric asthma using 449 asthmatic subjects and 512 non-asthma subjects in Taiwanese population. CD14-159 C/T and MS4A2 Glu237Gly were identified to have difference in genotype/allele frequencies between the control group and asthma patients. Moreover, the genotype synergistic analysis showed that the co-contribution of two functional SNPs was riskier or more protective from asthma attack. Our study provided a genotype synergistic method for studying gene-gene interaction on polymorphism basis and genetic testing using multiple polymorphisms.
Full Text Available DNA sequence variation within human leukocyte antigen (HLA genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A, -B, -C and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals. We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N = 918 with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the high-density Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.
Liu, Jason Z; Tozzi, Federica; Waterworth, Dawn M; Pillai, Sreekumar G; Muglia, Pierandrea; Middleton, Lefkos; Berrettini, Wade; Knouff, Christopher W; Yuan, Xin; Waeber, Gérard; Vollenweider, Peter; Preisig, Martin; Wareham, Nicholas J; Zhao, Jing Hua; Loos, Ruth J F; Barroso, Inês; Khaw, Kay-Tee; Grundy, Scott; Barter, Philip; Mahley, Robert; Kesaniemi, Antero; McPherson, Ruth; Vincent, John B; Strauss, John; Kennedy, James L; Farmer, Anne; McGuffin, Peter; Day, Richard; Matthews, Keith; Bakke, Per; Gulsvik, Amund; Lucae, Susanne; Ising, Marcus; Brueckl, Tanja; Horstmann, Sonja; Wichmann, H-Erich; Rawal, Rajesh; Dahmen, Norbert; Lamina, Claudia; Polasek, Ozren; Zgaga, Lina; Huffman, Jennifer; Campbell, Susan; Kooner, Jaspal; Chambers, John C; Burnett, Mary Susan; Devaney, Joseph M; Pichard, Augusto D; Kent, Kenneth M; Satler, Lowell; Lindsay, Joseph M; Waksman, Ron; Epstein, Stephen; Wilson, James F; Wild, Sarah H; Campbell, Harry; Vitart, Veronique; Reilly, Muredach P; Li, Mingyao; Qu, Liming; Wilensky, Robert; Matthai, William; Hakonarson, Hakon H; Rader, Daniel J; Franke, Andre; Wittig, Michael; Schäfer, Arne; Uda, Manuela; Terracciano, Antonio; Xiao, Xiangjun; Busonero, Fabio; Scheet, Paul; Schlessinger, David; St Clair, David; Rujescu, Dan; Abecasis, Gonçalo R; Grabe, Hans Jörgen; Teumer, Alexander; Völzke, Henry; Petersmann, Astrid; John, Ulrich; Rudan, Igor; Hayward, Caroline; Wright, Alan F; Kolcic, Ivana; Wright, Benjamin J; Thompson, John R; Balmforth, Anthony J; Hall, Alistair S; Samani, Nilesh J; Anderson, Carl A; Ahmad, Tariq; Mathew, Christopher G; Parkes, Miles; Satsangi, Jack; Caulfield, Mark; Munroe, Patricia B; Farrall, Martin; Dominiczak, Anna; Worthington, Jane; Thomson, Wendy; Eyre, Steve; Barton, Anne; Mooser, Vincent; Francks, Clyde; Marchini, Jonathan
Smoking is a leading global cause of disease and mortality. We established the Oxford-GlaxoSmithKline study (Ox-GSK) to perform a genome-wide meta-analysis of SNP association with smoking-related behavioral traits. Our final data set included 41,150 individuals drawn from 20 disease, population and control cohorts. Our analysis confirmed an effect on smoking quantity at a locus on 15q25 (P = 9.45 x 10(-19)) that includes CHRNA5, CHRNA3 and CHRNB4, three genes encoding neuronal nicotinic acetylcholine receptor subunits. We used data from the 1000 Genomes project to investigate the region using imputation, which allowed for analysis of virtually all common SNPs in the region and offered a fivefold increase in marker density over HapMap2 (ref. 2) as an imputation reference panel. Our fine-mapping approach identified a SNP showing the highest significance, rs55853698, located within the promoter region of CHRNA5. Conditional analysis also identified a secondary locus (rs6495308) in CHRNA3.
Barnett, Gillian C.; Coles, Charlotte E.; Burnet, Neil G.; Pharoah, Paul D.P.; Wilkinson, Jennifer; West, Catharine M.L.; Elliott, Rebecca M.; Baynes, Caroline; Dunning, Alison M.
Background and purpose: Several small studies have reported associations between TGFB1 single nucleotide polymorphisms (SNPs), considered to increase secretion of TGF-β1, and greater than 3-fold increases in incidence of fibrosis - an indicator of late toxicity after radiotherapy in breast cancer patients. Materials and methods: Two SNPs in TGFB1, C-509T (rs1800469) and L10P (rs1800470), were genotyped in 778 breast cancer patients who had received radiotherapy to the breast. Late radiotherapy toxicity was assessed two years after radiotherapy using a validated photographic technique, clinical assessment and patient questionnaires. Results: On photographic assessment, 210 (27%) patients showed some degree of breast shrinkage, whilst 45 (6%) patients showed marked breast shrinkage. There was no significant association of genotype at either of the TGFB1 SNPs with any measure of late radiation toxicity. Conclusion: This adequately powered trial failed to confirm previously reported increases in fibrosis with TGFB1 genotype - any increase greater than 1.36 can be excluded with 95% confidence. Similar frequent failures to replicate associations with candidate genes have been resolved using genome-wide association scans: this methodology detects common, low risk alleles but requires even larger patient numbers for adequate statistical power.
Time series data are common in medical researches. Many laboratory variables or study endpoints could be measured repeatedly over time. Multiple imputation (MI) without considering time trend of a variable may cause it to be unreliable. The article illustrates how to perform MI by using Amelia package in a clinical scenario. Amelia package is powerful in that it allows for MI for time series data. External information on the variable of interest can also be incorporated by using prior or bound argument. Such information may be based on previous published observations, academic consensus, and personal experience. Diagnostics of imputation model can be performed by examining the distributions of imputed and observed values, or by using over-imputation technique.
Full Text Available The National Ambulatory Medical Care Survey collects data on office-based physician care from a nationally representative, multistage sampling scheme where the ultimate unit of analysis is a patient-doctor encounter. Patient race, a commonly analyzed demographic, has been subject to a steadily increasing item nonresponse rate. In 1999, race was missing for 17 percent of cases; by 2008, that figure had risen to 33 percent. Over this entire period, single imputation has been the compensation method employed. Recent research at the National Center for Health Statistics evaluated multiply imputing race to better represent the missing-data uncertainty. Given item nonresponse rates of 30 percent or greater, we were surprised to find many estimates’ ratios of multiple-imputation to single-imputation estimated standard errors close to 1. A likely explanation is that the design effects attributable to the complex sample design largely outweigh any increase in variance attributable to missing-data uncertainty.
A practical guide to analysing partially observed data. Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods. This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures. Multiple Imputation and its Application: Discusses the issues ...
Schomaker, Michael; Heumann, Christian
Many modern estimators require bootstrapping to calculate confidence intervals because either no analytic standard error is available or the distribution of the parameter of interest is nonsymmetric. It remains however unclear how to obtain valid bootstrap inference when dealing with multiple imputation to address missing data. We present 4 methods that are intuitively appealing, easy to implement, and combine bootstrap estimation with multiple imputation. We show that 3 of the 4 approaches yield valid inference, but that the performance of the methods varies with respect to the number of imputed data sets and the extent of missingness. Simulation studies reveal the behavior of our approaches in finite samples. A topical analysis from HIV treatment research, which determines the optimal timing of antiretroviral treatment initiation in young children, demonstrates the practical implications of the 4 methods in a sophisticated and realistic setting. This analysis suffers from missing data and uses the g-formula for inference, a method for which no standard errors are available. Copyright © 2018 John Wiley & Sons, Ltd.
van Buuren, Stef
Missing data form a problem in every scientific discipline, yet the techniques required to handle them are complicated and often lacking. One of the great ideas in statistical science--multiple imputation--fills gaps in the data with plausible values, the uncertainty of which is coded in the data itself. It also solves other problems, many of which are missing data problems in disguise. Flexible Imputation of Missing Data is supported by many examples using real data taken from the author's vast experience of collaborative research, and presents a practical guide for handling missing data unde
Gedikoglu, Haluk; Parcell, Joseph L.
Missing data is a problem that occurs frequently in survey data. Missing data results in biased estimates and reduced efficiency for regression estimates. The objective of the current study is to analyze the impact of missing-data imputation, using multiple-imputation methods, on regression estimates for agricultural household surveys. The current study also analyzes the impact of multiple-imputation on regression results, when all the variables in the regression have missing observations. Fi...
Vink, G.; Frank, L.E.; Pannekoek, J.; Buuren, S. van
Multiple imputation methods properly account for the uncertainty of missing data. One of those methods for creating multiple imputations is predictive mean matching (PMM), a general purpose method. Little is known about the performance of PMM in imputing non-normal semicontinuous data (skewed data
Chen, Zhanghua; Pereira, Mark A; Seielstad, Mark; Koh, Woon-Puay; Tai, E Shyong; Teo, Yik-Ying; Liu, Jianjun; Hsu, Chris; Wang, Renwei; Odegaard, Andrew O; Thyagarajan, Bharat; Koratkar, Revati; Yuan, Jian-Min; Gross, Myron D; Stram, Daniel O
Genome-wide association studies (GWAS) have identified genetic factors in type 2 diabetes (T2D), mostly among individuals of European ancestry. We tested whether previously identified T2D-associated single nucleotide polymorphisms (SNPs) replicate and whether SNPs in regions near known T2D SNPs were associated with T2D within the Singapore Chinese Health Study. 2338 cases and 2339 T2D controls from the Singapore Chinese Health Study were genotyped for 507,509 SNPs. Imputation extended the genotyped SNPs to 7,514,461 with high estimated certainty (r(2)>0.8). Replication of known index SNP associations in T2D was attempted. Risk scores were computed as the sum of index risk alleles. SNPs in regions ± 100 kb around each index were tested for associations with T2D in conditional fine-mapping analysis. Of 69 index SNPs, 20 were genotyped directly and genotypes at 35 others were well imputed. Among the 55 SNPs with data, disease associations were replicated (at pSingapore is explained by these SNPs. While diabetes risk in Singapore Chinese involves genetic variants, most disease risk remains unexplained. Further genetic work is ongoing in the Singapore Chinese population to identify unique common variants not already seen in earlier studies. However rapid increases in T2D risk have occurred in recent decades in this population, indicating that dynamic environmental influences and possibly gene by environment interactions complicate the genetic architecture of this disease.
Full Text Available We propose a new methodology for multiple imputation when faced with missing data in multi-environmental trials with genotype-by-environment interaction, based on the imputation system developed by Krzanowski that uses the singular value decomposition (SVD of a matrix. Several different iterative variants are described; differential weights can also be included in each variant to represent the influence of different components of SVD in the imputation process. The methods are compared through a simulation study based on three real data matrices that have values deleted randomly at different percentages, using as measure of overall accuracy a combination of the variance between imputations and their mean square deviations relative to the deleted values. The best results are shown by two of the iterative schemes that use weights belonging to the interval [0.75, 1]. These schemes provide imputations that have higher quality when compared with other multiple imputation methods based on the Krzanowski method.
Frischknecht, Mirjam; Pausch, Hubert; Bapst, Beat; Signer-Hasler, Heidi; Flury, Christine; Garrick, Dorian; Stricker, Christian; Fries, Ruedi; Gredler-Grandl, Birgit
Within the last few years a large amount of genomic information has become available in cattle. Densities of genomic information vary from a few thousand variants up to whole genome sequence information. In order to combine genomic information from different sources and infer genotypes for a common set of variants, genotype imputation is required. In this study we evaluated the accuracy of imputation from high density chips to whole genome sequence data in Brown Swiss cattle. Using four popular imputation programs (Beagle, FImpute, Impute2, Minimac) and various compositions of reference panels, the accuracy of the imputed sequence variant genotypes was high and differences between the programs and scenarios were small. We imputed sequence variant genotypes for more than 1600 Brown Swiss bulls and performed genome-wide association studies for milk fat percentage at two stages of lactation. We found one and three quantitative trait loci for early and late lactation fat content, respectively. Known causal variants that were imputed from the sequenced reference panel were among the most significantly associated variants of the genome-wide association study. Our study demonstrates that whole-genome sequence information can be imputed at high accuracy in cattle populations. Using imputed sequence variant genotypes in genome-wide association studies may facilitate causal variant detection.
Harel, Ofer; Chung, Hwan; Miglioretti, Diana
Latent class regression (LCR) is a popular method for analyzing multiple categorical outcomes. While nonresponse to the manifest items is a common complication, inferences of LCR can be evaluated using maximum likelihood, multiple imputation, and two-stage multiple imputation. Under similar missing data assumptions, the estimates and variances from all three procedures are quite close. However, multiple imputation and two-stage multiple imputation can provide additional information: estimates for the rates of missing information. The methodology is illustrated using an example from a study on racial and ethnic disparities in breast cancer severity. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Chen, Wen-Pei; Hung, Che-Lun; Tsai, Suh-Jen Jane; Lin, Yaw-Ling
SNPs are the most abundant forms of genetic variations amongst species; the association studies between complex diseases and SNPs or haplotypes have received great attention. However, these studies are restricted by the cost of genotyping all SNPs; thus, it is necessary to find smaller subsets, or tag SNPs, representing the rest of the SNPs. In fact, the existing tag SNP selection algorithms are notoriously time-consuming. An efficient algorithm for tag SNP selection was presented, which was applied to analyze the HapMap YRI data. The experimental results show that the proposed algorithm can achieve better performance than the existing tag SNP selection algorithms; in most cases, this proposed algorithm is at least ten times faster than the existing methods. In many cases, when the redundant ratio of the block is high, the proposed algorithm can even be thousands times faster than the previously known methods. Tools and web services for haplotype block analysis integrated by hadoop MapReduce framework are also developed using the proposed algorithm as computation kernels.
John A. KershawJr
Full Text Available Background A novel approach to modelling individual tree growth dynamics is proposed. The approach combines multiple imputation and copula sampling to produce a stochastic individual tree growth and yield projection system. Methods The Nova Scotia, Canada permanent sample plot network is used as a case study to develop and test the modelling approach. Predictions from this model are compared to predictions from the Acadian variant of the Forest Vegetation Simulator, a widely used statistical individual tree growth and yield model. Results Diameter and height growth rates were predicted with error rates consistent with those produced using statistical models. Mortality and ingrowth error rates were higher than those observed for diameter and height, but also were within the bounds produced by traditional approaches for predicting these rates. Ingrowth species composition was very poorly predicted. The model was capable of reproducing a wide range of stand dynamic trajectories and in some cases reproduced trajectories that the statistical model was incapable of reproducing. Conclusions The model has potential to be used as a benchmarking tool for evaluating statistical and process models and may provide a mechanism to separate signal from noise and improve our ability to analyze and learn from large regional datasets that often have underlying flaws in sample design.
Full Text Available Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/.
Lüdtke, Oliver; Robitzsch, Alexander; Grund, Simon
Multiple imputation is a widely recommended means of addressing the problem of missing data in psychological research. An often-neglected requirement of this approach is that the imputation model used to generate the imputed values must be at least as general as the analysis model. For multilevel designs in which lower level units (e.g., students) are nested within higher level units (e.g., classrooms), this means that the multilevel structure must be taken into account in the imputation model. In the present article, we compare different strategies for multiply imputing incomplete multilevel data using mathematical derivations and computer simulations. We show that ignoring the multilevel structure in the imputation may lead to substantial negative bias in estimates of intraclass correlations as well as biased estimates of regression coefficients in multilevel models. We also demonstrate that an ad hoc strategy that includes dummy indicators in the imputation model to represent the multilevel structure may be problematic under certain conditions (e.g., small groups, low intraclass correlations). Imputation based on a multivariate linear mixed effects model was the only strategy to produce valid inferences under most of the conditions investigated in the simulation study. Data from an educational psychology research project are also used to illustrate the impact of the various multiple imputation strategies. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Idris, N. R. N.; Abdullah, M. H.; Tolos, S. M.
A common method of handling the problem of missing variances in meta-analysis of continuous response is through imputation. However, the performance of imputation techniques may be influenced by the type of model utilised. In this article, we examine through a simulation study the effects of the techniques of imputation of the missing SDs and type of models used on the overall meta-analysis estimates. The results suggest that imputation should be adopted to estimate the overall effect size, irrespective of the model used. However, the accuracy of the estimates of the corresponding standard error (SE) is influenced by the imputation techniques. For estimates based on the fixed effects model, mean imputation provides better estimates than multiple imputations, while those based on the random effects model responds more robustly to the type of imputation techniques. The results showed that although imputation is good in reducing the bias in point estimates, it is more likely to produce coverage probability which is higher than the nominal value.
Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz
Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.
Kadengye, Damazo T; Cools, Wilfried; Ceulemans, Eva; Van den Noortgate, Wim
Missing data, such as item responses in multilevel data, are ubiquitous in educational research settings. Researchers in the item response theory (IRT) context have shown that ignoring such missing data can create problems in the estimation of the IRT model parameters. Consequently, several imputation methods for dealing with missing item data have been proposed and shown to be effective when applied with traditional IRT models. Additionally, a nonimputation direct likelihood analysis has been shown to be an effective tool for handling missing observations in clustered data settings. This study investigates the performance of six simple imputation methods, which have been found to be useful in other IRT contexts, versus a direct likelihood analysis, in multilevel data from educational settings. Multilevel item response data were simulated on the basis of two empirical data sets, and some of the item scores were deleted, such that they were missing either completely at random or simply at random. An explanatory IRT model was used for modeling the complete, incomplete, and imputed data sets. We showed that direct likelihood analysis of the incomplete data sets produced unbiased parameter estimates that were comparable to those from a complete data analysis. Multiple-imputation approaches of the two-way mean and corrected item mean substitution methods displayed varying degrees of effectiveness in imputing data that in turn could produce unbiased parameter estimates. The simple random imputation, adjusted random imputation, item means substitution, and regression imputation methods seemed to be less effective in imputing missing item scores in multilevel data settings.
Li, Wenzhi; Xu, Wei; Li, Qiling; Ma, Li; Song, Qing
Imputation is a powerful in silico approach to fill in those missing values in the big datasets. This process requires a reference panel, which is a collection of big data from which the missing information can be extracted and imputed. Haplotype imputation requires ethnicity-matched references; a mismatched reference panel will significantly reduce the quality of imputation. However, currently existing big datasets cover only a small number of ethnicities, there is a lack of ethnicity-matched references for many ethnic populations in the world, which has hampered the data imputation of haplotypes and its downstream applications. To solve this issue, several approaches have been proposed and explored, including the mixed reference panel, the internal reference panel and genotype-converted reference panel. This review article provides the information and comparison between these approaches. Increasing evidence showed that not just one or two genetic elements dictate the gene activity and functions; instead, cis-interactions of multiple elements dictate gene activity. Cis-interactions require the interacting elements to be on the same chromosome molecule, therefore, haplotype analysis is essential for the investigation of cis-interactions among multiple genetic variants at different loci, and appears to be especially important for studying the common diseases. It will be valuable in a wide spectrum of applications from academic research, to clinical diagnosis, prevention, treatment, and pharmaceutical industry.
Full Text Available The package VIM (Templ, Alfons, Kowarik, and Prantner 2016 is developed to explore and analyze the structure of missing values in data using visualization methods, to impute these missing values with the built-in imputation methods and to verify the imputation process using visualization tools, as well as to produce high-quality graphics for publications. This article focuses on the different imputation techniques available in the package. Four different imputation methods are currently implemented in VIM, namely hot-deck imputation, k-nearest neighbor imputation, regression imputation and iterative robust model-based imputation (Templ, Kowarik, and Filzmoser 2011. All of these methods are implemented in a flexible manner with many options for customization. Furthermore in this article practical examples are provided to highlight the use of the implemented methods on real-world applications. In addition, the graphical user interface of VIM has been re-implemented from scratch resulting in the package VIMGUI (Schopfhauser, Templ, Alfons, Kowarik, and Prantner 2016 to enable users without extensive R skills to access these imputation and visualization methods.
De Silva, Anurika Priyanjali; Moreno-Betancur, Margarita; De Livera, Alysha Madhu; Lee, Katherine Jane; Simpson, Julie Anne
Missing data is a common problem in epidemiological studies, and is particularly prominent in longitudinal data, which involve multiple waves of data collection. Traditional multiple imputation (MI) methods (fully conditional specification (FCS) and multivariate normal imputation (MVNI)) treat repeated measurements of the same time-dependent variable as just another 'distinct' variable for imputation and therefore do not make the most of the longitudinal structure of the data. Only a few studies have explored extensions to the standard approaches to account for the temporal structure of longitudinal data. One suggestion is the two-fold fully conditional specification (two-fold FCS) algorithm, which restricts the imputation of a time-dependent variable to time blocks where the imputation model includes measurements taken at the specified and adjacent times. To date, no study has investigated the performance of two-fold FCS and standard MI methods for handling missing data in a time-varying covariate with a non-linear trajectory over time - a commonly encountered scenario in epidemiological studies. We simulated 1000 datasets of 5000 individuals based on the Longitudinal Study of Australian Children (LSAC). Three missing data mechanisms: missing completely at random (MCAR), and a weak and a strong missing at random (MAR) scenarios were used to impose missingness on body mass index (BMI) for age z-scores; a continuous time-varying exposure variable with a non-linear trajectory over time. We evaluated the performance of FCS, MVNI, and two-fold FCS for handling up to 50% of missing data when assessing the association between childhood obesity and sleep problems. The standard two-fold FCS produced slightly more biased and less precise estimates than FCS and MVNI. We observed slight improvements in bias and precision when using a time window width of two for the two-fold FCS algorithm compared to the standard width of one. We recommend the use of FCS or MVNI in a similar
Full Text Available Abstract Background Single nucleotide polymorphisms (SNPs are an abundant form of genetic variation in the genome of every species and are useful for gene mapping and association studies. Of particular interest are non-synonymous SNPs, which may alter protein function and phenotype. We therefore examined bovine expressed sequences for non-synonymous SNPs and validated and tested selected SNPs for their association with measured traits. Results Over 500,000 public bovine expressed sequence tagged (EST sequences were used to search for coding SNPs (cSNPs. A total of 15,353 SNPs were detected in the transcribed sequences studied, of which 6,325 were predicted to be coding SNPs with the remaining 9,028 SNPs presumed to be in untranslated regions. Of the cSNPs detected, 2,868 were predicted to result in a change in the amino acid encoded. In order to determine the actual number of non-synonymous polymorphic SNPs we designed assays for 920 of the putative SNPs. These SNPs were then genotyped through a panel of cattle DNA pools using chip-based MALDI-TOF mass spectrometry. Of the SNPs tested, 29% were found to be polymorphic with a minor allele frequency >10%. A subset of the SNPs was genotyped through animal resources in order to look for association with age of puberty, facial eczema resistance or meat yield. Three SNPs were nominally associated with resistance to the disease facial eczema (P Conclusion We have identified 15,353 putative SNPs in or close to bovine genes and 2,868 of these SNPs were predicted to be non-synonymous. Approximately 29% of the non-synonymous SNPs were polymorphic and common with a minor allele frequency >10%. Of the SNPs detected in this study, 99% have not been previously reported. These novel SNPs will be useful for association studies or gene mapping.
Tachmazidou, Ioanna; Süveges, Dániel; Min, Josine L
Deep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader alleli...
Full Text Available Red blood cell (RBC traits are routinely measured in clinical practice as important markers of health. Deviations from the physiological ranges are usually a sign of disease, although variation between healthy individuals also occurs, at least partly due to genetic factors. Recent large scale genetic studies identified loci associated with one or more of these traits; further characterization of known loci and identification of new loci is necessary to better understand their role in health and disease and to identify potential molecular mechanisms. We performed meta-analysis of Metabochip association results for six RBC traits-hemoglobin concentration (Hb, hematocrit (Hct, mean corpuscular hemoglobin (MCH, mean corpuscular hemoglobin concentration (MCHC, mean corpuscular volume (MCV and red blood cell count (RCC-in 11 093 Europeans from seven studies of the UCL-LSHTM-Edinburgh-Bristol (UCLEB Consortium. We identified 394 non-overlapping SNPs in five loci at genome-wide significance: 6p22.1-6p21.33 (with HFE among others, 6q23.2 (with HBS1L among others, 6q23.3 (contains no genes, 9q34.3 (only ABO gene and 22q13.1 (with TMPRSS6 among others, replicating previous findings of association with RBC traits at these loci and extending them by imputation to 1000 Genomes. We further characterized associations between ABO SNPs and three traits: hemoglobin, hematocrit and red blood cell count, replicating them in an independent cohort. Conditional analyses indicated the independent association of each of these traits with ABO SNPs and a role for blood group O in mediating the association. The 15 most significant RBC-associated ABO SNPs were also associated with five cardiometabolic traits, with discordance in the direction of effect between groups of traits, suggesting that ABO may act through more than one mechanism to influence cardiometabolic risk.
Kreiner-Møller, Eskil; Medina-Gomez, Carolina; Uitterlinden, André G
not being comprehensively scrutinized. Next-generation arrays ensuring sufficient coverage together with new reference panels, as the 1000 Genomes panel, are emerging to facilitate imputation of low frequent single-nucleotide polymorphisms (minor allele frequency (MAF) two-step......, the concordance rate between calls of imputed and true genotypes was found to be significantly higher for heterozygotes (Ptwo-step approach in our setting improves imputation quality compared with traditional direct imputation noteworthy...
Fernandes, R. C.; Lucio, P. S.; Fernandez, J. H.
The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechanical and human failure or technical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was used. Three different methods for monthly dataset imputation were selected: AMÉLIA II - runs the bootstrap Expectation Maximization algorithm, MICE - runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI - an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test and t-test. The results showed that for CLMX and ROME, the R2 and R statistics were equal to 0.98 and 0.96, respectively. It was observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.
Bondarenko, Irina; Raghunathan, Trivellore
Multiple imputation has become a popular approach for analyzing incomplete data. Many software packages are available to multiply impute the missing values and to analyze the resulting completed data sets. However, diagnostic tools to check the validity of the imputations are limited, and the majority of the currently available methods need considerable knowledge of the imputation model. In many practical settings, however, the imputer and the analyst may be different individuals or from different organizations, and the analyst model may or may not be congenial to the model used by the imputer. This article develops and evaluates a set of graphical and numerical diagnostic tools for two practical purposes: (i) for an analyst to determine whether the imputations are reasonable under his/her model assumptions without actually knowing the imputation model assumptions; and (ii) for an imputer to fine tune the imputation model by checking the key characteristics of the observed and imputed values. The tools are based on the numerical and graphical comparisons of the distributions of the observed and imputed values conditional on the propensity of response. The methodology is illustrated using simulated data sets created under a variety of scenarios. The examples focus on continuous and binary variables, but the principles can be used to extend methods for other types of variables. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Radi, Noor Fadhilah Ahmad; Zakaria, Roslinazairimah; Azman, Muhammad Az-zuhri
This study is aimed to estimate missing rainfall data by dividing the analysis into three different percentages namely 5%, 10% and 20% in order to represent various cases of missing data. In practice, spatial interpolation methods are chosen at the first place to estimate missing data. These methods include normal ratio (NR), arithmetic average (AA), coefficient of correlation (CC) and inverse distance (ID) weighting methods. The methods consider the distance between the target and the neighbouring stations as well as the correlations between them. Alternative method for solving missing data is an imputation method. Imputation is a process of replacing missing data with substituted values. A once-common method of imputation is single-imputation method, which allows parameter estimation. However, the single imputation method ignored the estimation of variability which leads to the underestimation of standard errors and confidence intervals. To overcome underestimation problem, multiple imputations method is used, where each missing value is estimated with a distribution of imputations that reflect the uncertainty about the missing data. In this study, comparison of spatial interpolation methods and multiple imputations method are presented to estimate missing rainfall data. The performance of the estimation methods used are assessed using the similarity index (S-index), mean absolute error (MAE) and coefficient of correlation (R).
Burgess, Stephen; White, Ian R; Resche-Rigon, Matthieu; Wood, Angela M
Multiple imputation is a strategy for the analysis of incomplete data such that the impact of the missingness on the power and bias of estimates is mitigated. When data from multiple studies are collated, we can propose both within-study and multilevel imputation models to impute missing data on covariates. It is not clear how to choose between imputation models or how to combine imputation and inverse-variance weighted meta-analysis methods. This is especially important as often different studies measure data on different variables, meaning that we may need to impute data on a variable which is systematically missing in a particular study. In this paper, we consider a simulation analysis of sporadically missing data in a single covariate with a linear analysis model and discuss how the results would be applicable to the case of systematically missing data. We find in this context that ensuring the congeniality of the imputation and analysis models is important to give correct standard errors and confidence intervals. For example, if the analysis model allows between-study heterogeneity of a parameter, then we should incorporate this heterogeneity into the imputation model to maintain the congeniality of the two models. In an inverse-variance weighted meta-analysis, we should impute missing data and apply Rubin's rules at the study level prior to meta-analysis, rather than meta-analyzing each of the multiple imputations and then combining the meta-analysis estimates using Rubin's rules. We illustrate the results using data from the Emerging Risk Factors Collaboration. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
Burgess, Stephen; White, Ian R; Resche-Rigon, Matthieu; Wood, Angela M
Multiple imputation is a strategy for the analysis of incomplete data such that the impact of the missingness on the power and bias of estimates is mitigated. When data from multiple studies are collated, we can propose both within-study and multilevel imputation models to impute missing data on covariates. It is not clear how to choose between imputation models or how to combine imputation and inverse-variance weighted meta-analysis methods. This is especially important as often different studies measure data on different variables, meaning that we may need to impute data on a variable which is systematically missing in a particular study. In this paper, we consider a simulation analysis of sporadically missing data in a single covariate with a linear analysis model and discuss how the results would be applicable to the case of systematically missing data. We find in this context that ensuring the congeniality of the imputation and analysis models is important to give correct standard errors and confidence intervals. For example, if the analysis model allows between-study heterogeneity of a parameter, then we should incorporate this heterogeneity into the imputation model to maintain the congeniality of the two models. In an inverse-variance weighted meta-analysis, we should impute missing data and apply Rubin's rules at the study level prior to meta-analysis, rather than meta-analyzing each of the multiple imputations and then combining the meta-analysis estimates using Rubin's rules. We illustrate the results using data from the Emerging Risk Factors Collaboration. PMID:23703895
Webb-Vargas, Yenny; Rudolph, Kara E; Lenis, David; Murakami, Peter; Stuart, Elizabeth A
Although covariate measurement error is likely the norm rather than the exception, methods for handling covariate measurement error in propensity score methods have not been widely investigated. We consider a multiple imputation-based approach that uses an external calibration sample with information on the true and mismeasured covariates, multiple imputation for external calibration, to correct for the measurement error, and investigate its performance using simulation studies. As expected, using the covariate measured with error leads to bias in the treatment effect estimate. In contrast, the multiple imputation for external calibration method can eliminate almost all the bias. We confirm that the outcome must be used in the imputation process to obtain good results, a finding related to the idea of congenial imputation and analysis in the broader multiple imputation literature. We illustrate the multiple imputation for external calibration approach using a motivating example estimating the effects of living in a disadvantaged neighborhood on mental health and substance use outcomes among adolescents. These results show that estimating the propensity score using covariates measured with error leads to biased estimates of treatment effects, but when a calibration data set is available, multiple imputation for external calibration can be used to help correct for such bias.
Deng, Yi; Chang, Changgee; Ido, Moges Seyoum; Long, Qi
Multiple imputation (MI) has been widely used for handling missing data in biomedical research. In the presence of high-dimensional data, regularized regression has been used as a natural strategy for building imputation models, but limited research has been conducted for handling general missing data patterns where multiple variables have missing values. Using the idea of multiple imputation by chained equations (MICE), we investigate two approaches of using regularized regression to impute missing values of high-dimensional data that can handle general missing data patterns. We compare our MICE methods with several existing imputation methods in simulation studies. Our simulation results demonstrate the superiority of the proposed MICE approach based on an indirect use of regularized regression in terms of bias. We further illustrate the proposed methods using two data examples.
Pappas, D J; Lizee, A; Paunic, V; Beutner, K R; Motyer, A; Vukcevic, D; Leslie, S; Biesiada, J; Meller, J; Taylor, K D; Zheng, X; Zhao, L P; Gourraud, P-A; Hollenbach, J A; Mack, S J; Maiers, M
Four single nucleotide polymorphism (SNP)-based human leukocyte antigen (HLA) imputation methods (e-HLA, HIBAG, HLA*IMP:02 and MAGPrediction) were trained using 1000 Genomes SNP and HLA genotypes and assessed for their ability to accurately impute molecular HLA-A, -B, -C and -DRB1 genotypes in the Human Genome Diversity Project cell panel. Imputation concordance was high (>89%) across all methods for both HLA-A and HLA-C, but HLA-B and HLA-DRB1 proved generally difficult to impute. Overall, <27.8% of subjects were correctly imputed for all HLA loci by any method. Concordance across all loci was not enhanced via the application of confidence thresholds; reliance on confidence scores across methods only led to noticeable improvement (+3.2%) for HLA-DRB1. As the HLA complex is highly relevant to the study of human health and disease, a standardized assessment of SNP-based HLA imputation methods is crucial for advancing genomic research. Considerable room remains for the improvement of HLA-B and especially HLA-DRB1 imputation methods, and no imputation method is as accurate as molecular genotyping. The application of large, ancestrally diverse HLA and SNP reference data sets and multiple imputation methods has the potential to make SNP-based HLA imputation methods a tractable option for determining HLA genotypes.The Pharmacogenomics Journal advance online publication, 25 April 2017; doi:10.1038/tpj.2017.7.
Van Ginkel, Joost R.
The performance of multiple imputation in questionnaire data has been studied in various simulation studies. However, in practice, questionnaire data are usually more complex than simulated data. For example, items may be counterindicative or may have unacceptably low factor loadings on every subscale, or completely missing subscales may…
Identification of novel SNPs of ABCD1, ABCD2, ABCD3, and ABCD4 genes in patients with X-linked adrenoleukodystrophy (ALD) based on comprehensive resequencing and association studies with ALD phenotypes.
Matsukawa, Takashi; Asheuer, Muriel; Takahashi, Yuji; Goto, Jun; Suzuki, Yasuyuki; Shimozawa, Nobuyuki; Takano, Hiroki; Onodera, Osamu; Nishizawa, Masatoyo; Aubourg, Patrick; Tsuji, Shoji
Adrenoleukodystrophy (ALD) is an X-linked disorder affecting primarily the white matter of the central nervous system occasionally accompanied by adrenal insufficiency. Despite the discovery of the causative gene, ABCD1, no clear genotype-phenotype correlations have been established. Association studies based on single nucleotide polymorphisms (SNPs) identified by comprehensive resequencing of genes related to ABCD1 may reveal genes modifying ALD phenotypes. We analyzed 40 Japanese patients with ALD. ABCD1 and ABCD2 were analyzed using a newly developed microarray-based resequencing system. ABCD3 and ABCD4 were analyzed by direct nucleotide sequence analysis. Replication studies were conducted on an independent French ALD cohort with extreme phenotypes. All the mutations of ABCD1 were identified, and there was no correlation between the genotypes and phenotypes of ALD. SNPs identified by the comprehensive resequencing of ABCD2, ABCD3, and ABCD4 were used for association studies. There were no significant associations between these SNPs and ALD phenotypes, except for the five SNPs of ABCD4, which are in complete disequilibrium in the Japanese population. These five SNPs were significantly less frequently represented in patients with adrenomyeloneuropathy (AMN) than in controls in the Japanese population (p=0.0468), whereas there were no significant differences in patients with childhood cerebral ALD (CCALD). The replication study employing these five SNPs on an independent French ALD cohort, however, showed no significant associations with CCALD or pure AMN. This study showed that ABCD2, ABCD3, and ABCD4 are less likely the disease-modifying genes, necessitating further studies to identify genes modifying ALD phenotypes.
Identification of novel SNPs of ABCD1, ABCD2, ABCD3, and ABCD4 genes in patients with X-linked adrenoleukodystrophy (ALD) based on comprehensive resequencing and association studies with ALD phenotypes
Matsukawa, Takashi; Asheuer, Muriel; Takahashi, Yuji; Goto, Jun; Suzuki, Yasuyuki; Shimozawa, Nobuyuki; Takano, Hiroki; Onodera, Osamu; Nishizawa, Masatoyo; Aubourg, Patrick
Adrenoleukodystrophy (ALD) is an X-linked disorder affecting primarily the white matter of the central nervous system occasionally accompanied by adrenal insufficiency. Despite the discovery of the causative gene, ABCD1, no clear genotype–phenotype correlations have been established. Association studies based on single nucleotide polymorphisms (SNPs) identified by comprehensive resequencing of genes related to ABCD1 may reveal genes modifying ALD phenotypes. We analyzed 40 Japanese patients with ALD. ABCD1 and ABCD2 were analyzed using a newly developed microarray-based resequencing system. ABCD3 and ABCD4 were analyzed by direct nucleotide sequence analysis. Replication studies were conducted on an independent French ALD cohort with extreme phenotypes. All the mutations of ABCD1 were identified, and there was no correlation between the genotypes and phenotypes of ALD. SNPs identified by the comprehensive resequencing of ABCD2, ABCD3, and ABCD4 were used for association studies. There were no significant associations between these SNPs and ALD phenotypes, except for the five SNPs of ABCD4, which are in complete disequilibrium in the Japanese population. These five SNPs were significantly less frequently represented in patients with adrenomyeloneuropathy (AMN) than in controls in the Japanese population (p = 0.0468), whereas there were no significant differences in patients with childhood cerebral ALD (CCALD). The replication study employing these five SNPs on an independent French ALD cohort, however, showed no significant associations with CCALD or pure AMN. This study showed that ABCD2, ABCD3, and ABCD4 are less likely the disease-modifying genes, necessitating further studies to identify genes modifying ALD phenotypes. Electronic supplementary material The online version of this article (doi:10.1007/s10048-010-0253-6) contains supplementary material, which is available to authorized users. PMID:20661612
... 16 Commercial Practices 2 2010-01-01 2010-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm...
Chemerin is a novel adipokine that regulates adipogenesis and adipocyte metabolism via its own receptor. In this study, two novel SNPs (868A>G in exon 2 and 2692C>T in exon 5) of chemerin gene were identified by PCR-SSCP and DNA sequencing technology. The allele frequencies of the novel SNPs were determined ...
Song, Honglin; Ramus, Susan J; Kjaer, Susanne Krüger
Because both ovarian and breast cancer are hormone-related and are known to have some predisposition genes in common, we evaluated 11 of the most significant hits (six with confirmed associations with breast cancer) from the breast cancer genome-wide association study for association with invasive...... cases and 6308 controls from eight independent studies. Only rs4954956 was significantly associated with ovarian cancer risk both in the replication study and in combined analyses. This association was stronger for the serous histological subtype [per minor allele odds ratio (OR) 1.07 95% CI 1.......01-1.13, P-trend = 0.02 for all types of ovarian cancer and OR 1.14 95% CI 1.07-1.22, P-trend = 0.00017 for serous ovarian cancer]. In conclusion, we found that rs4954956 was associated with increased ovarian cancer risk, particularly for serous ovarian cancer. However, none of the six confirmed breast...
Sørensen, Mette; Nygaard, Marianne; Dato, Serena
-old Danes (age 92-93) with 4 phenotypes known to predict their survival: cognitive function, hand grip strength, activity of daily living (ADL), and self-rated health. Based on previous studies in humans and foxo animal models, we also explore self-reported diabetes, cancer, cardiovascular disease......, osteoporosis, and bone (femur/spine/hip/wrist) fracture. Gene-based testing revealed significant associations of FOXO3A variation with ADL (P = 0.044) and bone fracture (P = 0.006). The single-SNP statistics behind the gene-based analysis indicated increased ADL (decreased disability) and reduced bone fracture...
Leyrat, Clémence; Seaman, Shaun R; White, Ian R; Douglas, Ian; Smeeth, Liam; Kim, Joseph; Resche-Rigon, Matthieu; Carpenter, James R; Williamson, Elizabeth J
Inverse probability of treatment weighting is a popular propensity score-based approach to estimate marginal treatment effects in observational studies at risk of confounding bias. A major issue when estimating the propensity score is the presence of partially observed covariates. Multiple imputation is a natural approach to handle missing data on covariates: covariates are imputed and a propensity score analysis is performed in each imputed dataset to estimate the treatment effect. The treatment effect estimates from each imputed dataset are then combined to obtain an overall estimate. We call this method MIte. However, an alternative approach has been proposed, in which the propensity scores are combined across the imputed datasets (MIps). Therefore, there are remaining uncertainties about how to implement multiple imputation for propensity score analysis: (a) should we apply Rubin's rules to the inverse probability of treatment weighting treatment effect estimates or to the propensity score estimates themselves? (b) does the outcome have to be included in the imputation model? (c) how should we estimate the variance of the inverse probability of treatment weighting estimator after multiple imputation? We studied the consistency and balancing properties of the MIte and MIps estimators and performed a simulation study to empirically assess their performance for the analysis of a binary outcome. We also compared the performance of these methods to complete case analysis and the missingness pattern approach, which uses a different propensity score model for each pattern of missingness, and a third multiple imputation approach in which the propensity score parameters are combined rather than the propensity scores themselves (MIpar). Under a missing at random mechanism, complete case and missingness pattern analyses were biased in most cases for estimating the marginal treatment effect, whereas multiple imputation approaches were approximately unbiased as long as the
Wu, Yuanshan; Yin, Guosheng
The main challenge in the context of cure rate analysis is that one never knows whether censored subjects are cured or uncured, or whether they are susceptible or insusceptible to the event of interest. Considering the susceptible indicator as missing data, we propose a multiple imputation approach to cure rate quantile regression for censored data with a survival fraction. We develop an iterative algorithm to estimate the conditionally uncured probability for each subject. By utilizing this estimated probability and Bernoulli sample imputation, we can classify each subject as cured or uncured, and then employ the locally weighted method to estimate the quantile regression coefficients with only the uncured subjects. Repeating the imputation procedure multiple times and taking an average over the resultant estimators, we obtain consistent estimators for the quantile regression coefficients. Our approach relaxes the usual global linearity assumption, so that we can apply quantile regression to any particular quantile of interest. We establish asymptotic properties for the proposed estimators, including both consistency and asymptotic normality. We conduct simulation studies to assess the finite-sample performance of the proposed multiple imputation method and apply it to a lung cancer study as an illustration. © 2016, The International Biometric Society.
Ma, Peipei; Brøndum, Rasmus Froberg; Qin, Zahng
This study investigated the imputation accuracy of different methods, considering both the minor allele frequency and relatedness between individuals in the reference and test data sets. Two data sets from the combined population of Swedish and Finnish Red Cattle were used to test the influence...... coefficient was lower when the minor allele frequency was lower. The results indicate that Beagle and IMPUTE2 provide the most robust and accurate imputation accuracies, but considering computing time and memory usage, FImpute is another alternative method....
Kunkel, Deborah; Kaizar, Eloise E
Multiple imputation is a popular method for addressing missing data, but its implementation is difficult when data have a multilevel structure and one or more variables are systematically missing. This systematic missing data pattern may commonly occur in meta-analysis of individual participant data, where some variables are never observed in some studies, but are present in other hierarchical data settings. In these cases, valid imputation must account for both relationships between variables and correlation within studies. Proposed methods for multilevel imputation include specifying a full joint model and multiple imputation with chained equations (MICE). While MICE is attractive for its ease of implementation, there is little existing work describing conditions under which this is a valid alternative to specifying the full joint model. We present results showing that for multilevel normal models, MICE is rarely exactly equivalent to joint model imputation. Through a simulation study and an example using data from a traumatic brain injury study, we found that in spite of theoretical differences, MICE imputations often produce results similar to those obtained using the joint model. We also assess the influence of prior distributions in MICE imputation methods and find that when missingness is high, prior choices in MICE models tend to affect estimation of across-study variability more than compatibility of conditional likelihoods. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Madsen, Bo Eskerod; Villesen, Palle; Wiuf, Carsten
By surveying a filtered, high-quality set of SNPs in the human genome, we have found that SNPs positioned 1, 2, 4, 6, or 8 bp apart are more frequent than SNPs positioned 3, 5, 7, or 9 bp apart. The observed pattern is not restricted to genomic regions that are known to cause sequencing...... periodic DNA. Our results suggest that not all SNPs in the human genome are created by independent single nucleotide mutations, and that care should be taken in analysis of SNPs from periodic DNA. The latter may have important consequences for SNP and association studies....... or alignment errors, for example, transposable elements (SINE, LINE, and LTR), tandem repeats, and large duplicated regions. However, we found that the pattern is almost entirely confined to what we define as "periodic DNA." Periodic DNA is a genomic region with a high degree of periodicity in nucleotide usage...
Sbarra, David A.; Emery, Robert E.
Using statistically imputed data to increase available power, this article reevaluated the long-term effects of divorce mediation on adults’ psychological adjustment and investigated the relations among coparenting custody conflict, nonacceptance of marital termination, and depression at 2 occasions over a decade apart following marital dissolution. Group comparisons revealed that fathers and parents who mediated their custody disputes reported significantly more nonacceptance at the 12-year follow-up assessment. Significant interactions were observed by gender in regression models predicting nonacceptance at the follow-up; mothers’ nonacceptance was positively associated with concurrent depression, whereas fathers’ nonacceptance was positively associated with early nonacceptance and negatively associated with concurrent conflict. PMID:15709851
Lee, Michael A; Keane, Orla M; Glass, Belinda C; Manley, Tim R; Cullen, Neil G; Dodds, Ken G; McCulloch, Alan F; Morris, Chris A; Schreiber, Mark; Warren, Jonathan; Zadissa, Amonida; Wilson, Theresa; McEwan, John C
Single nucleotide polymorphisms (SNPs) are an abundant form of genetic variation in the genome of every species and are useful for gene mapping and association studies. Of particular interest are non-synonymous SNPs, which may alter protein function and phenotype. We therefore examined bovine expressed sequences for non-synonymous SNPs and validated and tested selected SNPs for their association with measured traits. Over 500,000 public bovine expressed sequence tagged (EST) sequences were used to search for coding SNPs (cSNPs). A total of 15,353 SNPs were detected in the transcribed sequences studied, of which 6,325 were predicted to be coding SNPs with the remaining 9,028 SNPs presumed to be in untranslated regions. Of the cSNPs detected, 2,868 were predicted to result in a change in the amino acid encoded. In order to determine the actual number of non-synonymous polymorphic SNPs we designed assays for 920 of the putative SNPs. These SNPs were then genotyped through a panel of cattle DNA pools using chip-based MALDI-TOF mass spectrometry. Of the SNPs tested, 29% were found to be polymorphic with a minor allele frequency >10%. A subset of the SNPs was genotyped through animal resources in order to look for association with age of puberty, facial eczema resistance or meat yield. Three SNPs were nominally associated with resistance to the disease facial eczema (P 10%. Of the SNPs detected in this study, 99% have not been previously reported. These novel SNPs will be useful for association studies or gene mapping.
Dario, Paulo; Oliveira, Ana Rita; Ribeiro, Teresa; Porto, Maria João; Dias, Deodália; Corte Real, Francisco
In recent years, autosomal single nucleotide polymorphisms (SNPs) have been comprehensively investigated in forensic research due to their usefulness in certain circumstances in complementing short tandem repeats (STRs) analysis, or even for use on their own when analysis of STRs fails. However, as with STRs, in order to properly use SNP markers in forensic casuistic we need to understand the population and forensic parameters in question. As a result of Portugal's colonial history during the time of empire, and the subsequent process of decolonization, some African individuals migrated to Portugal, giving rise to large African and African-descendent communities. One of these groups is the community originating from Guinea-Bissau, that in 2014, was enumerated to consist of more than 17,700 individuals with official residency status, more than the third major city of Guinea-Bissau. In order to study the population and forensic parameters mentioned above for the two populations important to our casuistic, a total of 142 unrelated individuals from the South of Portugal and 90 immigrants from Guinea-Bissau (equally non related and all residing in Portugal) were typed with SNaPshot™ assay for all 52 loci included in the SNPforID 52plex. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Beesley, Lauren J; Bartlett, Jonathan W; Wolf, Gregory T; Taylor, Jeremy M G
We explore several approaches for imputing partially observed covariates when the outcome of interest is a censored event time and when there is an underlying subset of the population that will never experience the event of interest. We call these subjects 'cured', and we consider the case where the data are modeled using a Cox proportional hazards (CPH) mixture cure model. We study covariate imputation approaches using fully conditional specification. We derive the exact conditional distribution and suggest a sampling scheme for imputing partially observed covariates in the CPH cure model setting. We also propose several approximations to the exact distribution that are simpler and more convenient to use for imputation. A simulation study demonstrates that the proposed imputation approaches outperform existing imputation approaches for survival data without a cure fraction in terms of bias in estimating CPH cure model parameters. We apply our multiple imputation techniques to a study of patients with head and neck cancer. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Twisk, Jos; de Boer, Michiel; de Vente, Wieke; Heymans, Martijn
As a result of the development of sophisticated techniques, such as multiple imputation, the interest in handling missing data in longitudinal studies has increased enormously in past years. Within the field of longitudinal data analysis, there is a current debate on whether it is necessary to use multiple imputations before performing a mixed-model analysis to analyze the longitudinal data. In the current study this necessity is evaluated. The results of mixed-model analyses with and without multiple imputation were compared with each other. Four data sets with missing values were created-one data set with missing completely at random, two data sets with missing at random, and one data set with missing not at random). In all data sets, the relationship between a continuous outcome variable and two different covariates were analyzed: a time-independent dichotomous covariate and a time-dependent continuous covariate. Although for all types of missing data, the results of the mixed-model analysis with or without multiple imputations were slightly different, they were not in favor of one of the two approaches. In addition, repeating the multiple imputations 100 times showed that the results of the mixed-model analysis with multiple imputation were quite unstable. It is not necessary to handle missing data using multiple imputations before performing a mixed-model analysis on longitudinal data. Copyright © 2013 Elsevier Inc. All rights reserved.
Speidel, Matthias; Drechsler, Jörg; Sakshaug, Joseph W
When datasets are affected by nonresponse, imputation of the missing values is a viable solution. However, most imputation routines implemented in commonly used statistical software packages do not accommodate multilevel models that are popular in education research and other settings involving clustering of units. A common strategy to take the hierarchical structure of the data into account is to include cluster-specific fixed effects in the imputation model. Still, this ad hoc approach has never been compared analytically to the congenial multilevel imputation in a random slopes setting. In this paper, we evaluate the impact of the cluster-specific fixed-effects imputation model on multilevel inference. We show analytically that the cluster-specific fixed-effects imputation strategy will generally bias inferences obtained from random coefficient models. The bias of random-effects variances and global fixed-effects confidence intervals depends on the cluster size, the relation of within- and between-cluster variance, and the missing data mechanism. We illustrate the negative implications of cluster-specific fixed-effects imputation using simulation studies and an application based on data from the National Educational Panel Study (NEPS) in Germany.
Ma, Peipei; Lund, Mogens Sandø; Ding, X
This study investigated the effect of including Nordic Holsteins in the reference population on the imputation accuracy and prediction accuracy for Chinese Holsteins. The data used in this study include 85 Chinese Holstein bulls genotyped with both 54K chip and 777K (HD) chip, 2862 Chinese cows...... was improved slightly when using the marker data imputed based on the combined HD reference data, compared with using the marker data imputed based on the Chinese HD reference data only. On the other hand, when using the combined reference population including 4398 Nordic Holstein bulls, the accuracy...... to increase reference population rather than increasing marker density...
Zhou, Muhan; He, Yulei; Yu, Mandi; Hsu, Chiu-Hsieh
Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome
Full Text Available Genotype imputation is a vital tool in genome-wide association studies (GWAS and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR+CEU+YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation-based analysis in Latinos.
Bak, Nikolaj; Hansen, Lars Kai
indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought...... to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the "true error" in each of these new cases. The error is then estimated for each case...... with missing values by weighing the "true errors" by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method...
James R. Carpenter; Harvey Goldstein; Michael G. Kenward
Multiple imputation is becoming increasingly established as the leading practical approach to modelling partially observed data, under the assumption that the data are missing at random. However, many medical and social datasets are multilevel, and this structures should be reflected not only in the model of interest, but also in the imputation model. In particular, the imputation model should reflect the differences between level 1 variables and level 2 variables (which are constant across l...
Zheng, Jie; Li, Mupeng; Sun, Tao; Hu, Xiaolei; Li, Yuanjian; Chen, Xiaoping
To identify BSG single nucleotide polymorphisms (SNPs) in Chinese Han population. Peripheral blood samples were collected from 48 unrelated healthy Chinese Han subjects. Sequences at the BSG locus, including the promoter region, all exons and exon-intron boundaries were amplified, sequenced and followed by Hardy-Weinberg equilibrium test and linkage disequilibrium (LD) analysis. A total of 19 SNPs were identified, 2 of which two were novel. Genotype distributions of all SNPs were consistent with Hardy-Weinberg equilibrium. Four haplotype blocks were constructed throughout the gene locus, and 9 haplotype tag SNPs (htSNPs) were inferred. Distribution of SNPs was in accordance with the neutrality theory in Chinese Han population. For the first time, systematic identification of BSG SNPs in the Chinese Han population has been done, and 9 htSNPs are identified. Our study has provided basis for further genetic association studies for related diseases as well as pharmacogenetics study for drug response in Chinese Han population.
The FMR1 gene, a member of the fragile X-related gene family, is responsible for fragile X syndrome (FXS). Missense single-nucleotide polymorphisms (SNPs) are responsible for many complex diseases. The effect of FMR1 gene missense SNPs is unknown. The aim of this study, using in silico techniques, was to analyze all known missense mutations that can affect the functionality of the FMR1 gene, leading to mental retardation (MR) and FXS. Data on the human FMR1 gene were collected from the Ensembl database (release 81), National Centre for Biological Information dbSNP Short Genetic Variations database, 1000 Genomes Browser, and NHLBI Exome Sequencing Project Exome Variant Server. In silico analysis was then performed. One hundred-twenty different missense SNPs of the FMR1 gene were determined. Of these, 11.66 % of the FMR1 gene missense SNPs were in highly conserved domains, and 83.33 % were in domains with high variety. The results of the in silico prediction analysis showed that 31.66 % of the FMR1 gene SNPs were disease related and that 50 % of SNPs had a pathogenic effect. The results of the structural and functional analysis revealed that although the R138Q mutation did not seem to have a damaging effect on the protein, the G266E and I304N SNPs appeared to disturb the interaction between the domains and affect the function of the protein. This is the first study to analyze all missense SNPs of the FMR1 gene. The results indicate the applicability of a bioinformatics approach to FXS and other FMR1-related diseases. I think that the analysis of FMR1 gene missense SNPs using bioinformatics methods would help diagnosis of FXS and other FMR1-related diseases.
Bernhardt, Paul W; Wang, Huixia Judy; Zhang, Daowen
Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.
Seaman, Shaun R; Bartlett, Jonathan W; White, Ian R
Multiple imputation is often used for missing data. When a model contains as covariates more than one function of a variable, it is not obvious how best to impute missing values in these covariates. Consider a regression with outcome Y and covariates X and X2. In 'passive imputation' a value X* is imputed for X and then X2 is imputed as (X*)2. A recent proposal is to treat X2 as 'just another variable' (JAV) and impute X and X2 under multivariate normality. We use simulation to investigate the performance of three methods that can easily be implemented in standard software: 1) linear regression of X on Y to impute X then passive imputation of X2; 2) the same regression but with predictive mean matching (PMM); and 3) JAV. We also investigate the performance of analogous methods when the analysis involves an interaction, and study the theoretical properties of JAV. The application of the methods when complete or incomplete confounders are also present is illustrated using data from the EPIC Study. JAV gives consistent estimation when the analysis is linear regression with a quadratic or interaction term and X is missing completely at random. When X is missing at random, JAV may be biased, but this bias is generally less than for passive imputation and PMM. Coverage for JAV was usually good when bias was small. However, in some scenarios with a more pronounced quadratic effect, bias was large and coverage poor. When the analysis was logistic regression, JAV's performance was sometimes very poor. PMM generally improved on passive imputation, in terms of bias and coverage, but did not eliminate the bias. Given the current state of available software, JAV is the best of a set of imperfect imputation methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression.
Seaman Shaun R
Full Text Available Abstract Background Multiple imputation is often used for missing data. When a model contains as covariates more than one function of a variable, it is not obvious how best to impute missing values in these covariates. Consider a regression with outcome Y and covariates X and X2. In 'passive imputation' a value X* is imputed for X and then X2 is imputed as (X*2. A recent proposal is to treat X2 as 'just another variable' (JAV and impute X and X2 under multivariate normality. Methods We use simulation to investigate the performance of three methods that can easily be implemented in standard software: 1 linear regression of X on Y to impute X then passive imputation of X2; 2 the same regression but with predictive mean matching (PMM; and 3 JAV. We also investigate the performance of analogous methods when the analysis involves an interaction, and study the theoretical properties of JAV. The application of the methods when complete or incomplete confounders are also present is illustrated using data from the EPIC Study. Results JAV gives consistent estimation when the analysis is linear regression with a quadratic or interaction term and X is missing completely at random. When X is missing at random, JAV may be biased, but this bias is generally less than for passive imputation and PMM. Coverage for JAV was usually good when bias was small. However, in some scenarios with a more pronounced quadratic effect, bias was large and coverage poor. When the analysis was logistic regression, JAV's performance was sometimes very poor. PMM generally improved on passive imputation, in terms of bias and coverage, but did not eliminate the bias. Conclusions Given the current state of available software, JAV is the best of a set of imperfect imputation methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression.
Hamed Kharrati Koopaee
Full Text Available Single nucleotide polymorphisms (SNPs are DNA sequence variations that occur when a single nucleotide: adenine (A, thymine (T, cytosine (C or guanine (G in the genome sequence is altered. Traditional and high throughput methods are two main strategies for SNPs genotyping. The SNPs genotyping technologies provide powerful resources for animal breeding programs.Genomic selection using SNPs is a new tool for choosing the best breeding animals. In addition, the high density maps using SNPs can provide useful genetic tools to study quantitative traits genetic variations. There are many sources of SNPs and exhaustive numbers of methods of SNP detection to be considered. For many traits in farm animals, the rate of genetic improvement can be nearly doubled when SNPs information is used compared to the current methods of genetic evaluation. The goal of this review is to characterize the SNPs genotyping methods and their applications in farm animals breeding.
Bernaards, Coen A.; Sijtsma, Klaas
Using simulation, studied the influence of each of 12 imputation methods and 2 methods using the EM algorithm on the results of maximum likelihood factor analysis as compared with results from the complete data factor analysis (no missing scores). Discusses why EM methods recovered complete data factor loadings better than imputation methods. (SLD)
James R. Carpenter
Full Text Available Multiple imputation is becoming increasingly established as the leading practical approach to modelling partially observed data, under the assumption that the data are missing at random. However, many medical and social datasets are multilevel, and this structure should be reflected not only in the model of interest, but also in the imputation model. In particular, the imputation model should re ect the dierences between level 1 variables and level 2 variables (which are constant across level 1 units. This led us to develop the REALCOM-IMPUTE software, which we describe in this article. This software performs multilevel multiple imputation, and handles ordinal and unordered categorical data appropriately. It is freely available on-line, and may be used either as a standalone package, or in conjunction with the multilevel software MLwiN or Stata.
José Ruy Porto de Carvalho
Full Text Available Abstract Modeling by multiple enchained imputation is an area of growing importance. However, its models and methods are frequently developed for specific applications. In this study the model for multiple imputation was used to estimate daily rainfall data. Daily precipitation records from several meteorological stations were used, obtained from system AGRITEMPO for two homogenous climatic zones. The precipitation values obtained for two dates (Jan. 20th 2005 and May 2nd 2005 using the multiple imputation model were compared with geo-statistics techniques ordinary Kriging and Co-kriging with the altitude as an auxiliary variable. The multiple imputation model was 16% better for the first zone and over 23% for the second one, compared to the rainfall estimation obtained by geo-statistical techniques. The model proved to be a versatile technique, presenting coherent results with the conditions of different zones and times.
Nur Afiqah Zakaria
Full Text Available The air quality measurement data obtained from the continuous ambient air quality monitoring (CAAQM station usually contained missing data. The missing observations of the data usually occurred due to machine failure, routine maintenance and human error. In this study, the hourly monitoring data of CO, O3, PM10, SO2, NOx, NO2, ambient temperature and humidity were used to evaluate four imputation methods (Mean Top Bottom, Linear Regression, Multiple Imputation and Nearest Neighbour. The air pollutants observations were simulated into four percentages of simulated missing data i.e. 5%, 10%, 15% and 20%. Performance measures namely the Mean Absolute Error, Root Mean Squared Error, Coefficient of Determination and Index of Agreement were used to describe the goodness of fit of the imputation methods. From the results of the performance measures, Mean Top Bottom method was selected as the most appropriate imputation method for filling in the missing values in air pollutants data.
Bouwman, Aniek C; Veerkamp, Roel F
The aim of this study was to determine the consequences of splitting sequencing effort over multiple breeds for imputation accuracy from a high-density SNP chip towards whole-genome sequence. Such information would assist for instance numerical smaller cattle breeds, but also pig and chicken breeders, who have to choose wisely how to spend their sequencing efforts over all the breeds or lines they evaluate. Sequence data from cattle breeds was used, because there are currently relatively many individuals from several breeds sequenced within the 1,000 Bull Genomes project. The advantage of whole-genome sequence data is that it carries the causal mutations, but the question is whether it is possible to impute the causal variants accurately. This study therefore focussed on imputation accuracy of variants with low minor allele frequency and breed specific variants. Imputation accuracy was assessed for chromosome 1 and 29 as the correlation between observed and imputed genotypes. For chromosome 1, the average imputation accuracy was 0.70 with a reference population of 20 Holstein, and increased to 0.83 when the reference population was increased by including 3 other dairy breeds with 20 animals each. When the same amount of animals from the Holstein breed were added the accuracy improved to 0.88, while adding the 3 other breeds to the reference population of 80 Holstein improved the average imputation accuracy marginally to 0.89. For chromosome 29, the average imputation accuracy was lower. Some variants benefitted from the inclusion of other breeds in the reference population, initially determined by the MAF of the variant in each breed, but even Holstein specific variants did gain imputation accuracy from the multi-breed reference population. This study shows that splitting sequencing effort over multiple breeds and combining the reference populations is a good strategy for imputation from high-density SNP panels towards whole-genome sequence when reference
In health and medical sciences, multiple imputation (MI) is now becoming popular to obtain valid inferences in the presence of missing data. However, MI of clustered data such as multicenter studies and individual participant data meta-analysis requires advanced imputation routines that preserve the hierarchical structure of data. In clustered data, a specific challenge is the presence of systematically missing data, when a variable is completely missing in some clusters, and sporadically missing data, when it is partly missing in some clusters. Unfortunately, little is known about how to perform MI when both types of missing data occur simultaneously. We develop a new class of hierarchical imputation approach based on chained equations methodology that simultaneously imputes systematically and sporadically missing data while allowing for arbitrary patterns of missingness among them. Here, we use a random effect imputation model and adopt a simplification over fully Bayesian techniques such as Gibbs sampler to directly obtain draws of parameters within each step of the chained equations. We justify through theoretical arguments and extensive simulation studies that the proposed imputation methodology has good statistical properties in terms of bias and coverage rates of parameter estimates. An illustration is given in a case study with eight individual participant datasets. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Xu, Zheng; Duan, Qing; Yan, Song; Chen, Wei; Li, Mingyao; Lange, Ethan; Li, Yun
Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates. We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO). We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9-15.2% for variants with minor allele frequency <5%. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: firstname.lastname@example.org.
de Goeij, Moniek C M; van Diepen, Merel; Jager, Kitty J; Tripepi, Giovanni; Zoccali, Carmine; Dekker, Friedo W
In many fields, including the field of nephrology, missing data are unfortunately an unavoidable problem in clinical/epidemiological research. The most common methods for dealing with missing data are complete case analysis-excluding patients with missing data--mean substitution--replacing missing values of a variable with the average of known values for that variable-and last observation carried forward. However, these methods have severe drawbacks potentially resulting in biased estimates and/or standard errors. In recent years, a new method has arisen for dealing with missing data called multiple imputation. This method predicts missing values based on other data present in the same patient. This procedure is repeated several times, resulting in multiple imputed data sets. Thereafter, estimates and standard errors are calculated in each imputation set and pooled into one overall estimate and standard error. The main advantage of this method is that missing data uncertainty is taken into account. Another advantage is that the method of multiple imputation gives unbiased results when data are missing at random, which is the most common type of missing data in clinical practice, whereas conventional methods do not. However, the method of multiple imputation has scarcely been used in medical literature. We, therefore, encourage authors to do so in the future when possible.
Junger, W. L.; Ponce de Leon, A.
Missing data are major concerns in epidemiological studies of the health effects of environmental air pollutants. This article presents an imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution. Different approaches are considered for filtering the temporal component. A simulation study was performed to assess validity and performance of proposed method in comparison with some frequently used methods. Simulations showed that when the amount of missing data was as low as 5%, the complete data analysis yielded satisfactory results regardless of the generating mechanism of the missing data, whereas the validity began to degenerate when the proportion of missing values exceeded 10%. The proposed imputation method exhibited good accuracy and precision in different settings with respect to the patterns of missing observations. Most of the imputations obtained valid results, even under missing not at random. The methods proposed in this study are implemented as a package called mtsdi for the statistical software system R.
Morisot, Adeline; Bessaoud, Faïza; Landais, Paul; Rébillard, Xavier; Trétarre, Brigitte; Daurès, Jean-Pierre
Estimations of survival rates are diverse and the choice of the appropriate method depends on the context. Given the increasing interest in multiple imputation methods, we explored the interest of a multiple imputation approach in the estimation of cause-specific survival, when a subset of causes of death was observed. By using European Randomized Study of Screening for Prostate Cancer (ERSPC), 20 multiply imputed datasets were created and analyzed with a Multivariate Imputation by Chained Equation (MICE) algorithm. Then, cause-specific survival was estimated on each dataset with two methods: Kaplan-Meier and competing risks. The two pooled cause-specific survival and confidence intervals were obtained using Rubin's rules after complementary log-log transformation. Net survival was estimated using Pohar-Perme's estimator and was compared to pooled cause-specific survival. Finally, a sensitivity analysis was performed to test the robustness of our constructed multiple imputation model. Cause-specific survival performed better than net survival, since this latter exceeded 100 % for almost the first 2 years of follow-up and after 9 years whereas the cause-specific survival decreased slowly and than stabilized at around 94 % at 9 years. Sensibility study results were satisfactory. On our basis of prostate cancer data, the results obtained by cause-specific survival after multiple imputation appeared to be better and more realistic than those obtained using net survival.
Yamaguchi, Yusuke; Misumi, Toshihiro; Maruo, Kazushi
Longitudinal binary data are commonly encountered in clinical trials. Multiple imputation is an approach for getting a valid estimation of treatment effects under an assumption of missing at random mechanism. Although there are a variety of multiple imputation methods for the longitudinal binary data, a limited number of researches have reported on relative performances of the methods. Moreover, when focusing on the treatment effect throughout a period that has often been used in clinical evaluations of specific disease areas, no definite investigations comparing the methods have been available. We conducted an extensive simulation study to examine comparative performances of six multiple imputation methods available in the SAS MI procedure for longitudinal binary data, where two endpoints of responder rates at a specified time point and throughout a period were assessed. The simulation study suggested that results from naive approaches of a single imputation with non-responders and a complete case analysis could be very sensitive against missing data. The multiple imputation methods using a monotone method and a full conditional specification with a logistic regression imputation model were recommended for obtaining unbiased and robust estimations of the treatment effect. The methods were illustrated with data from a mental health research.
Foy, M.; VanBaalen, M.; Wear, M.; Mendez, C.; Mason, S.; Meyers, V.; Alexander, D.; Law, J.
The default method of dealing with missing data in statistical analyses is to only use the complete observations (complete case analysis), which can lead to unexpected bias when data do not meet the assumption of missing completely at random (MCAR). For the assumption of MCAR to be met, missingness cannot be related to either the observed or unobserved variables. A less stringent assumption, missing at random (MAR), requires that missingness not be associated with the value of the missing variable itself, but can be associated with the other observed variables. When data are truly MAR as opposed to MCAR, the default complete case analysis method can lead to biased results. There are statistical options available to adjust for data that are MAR, including multiple imputation (MI) which is consistent and efficient at estimating effects. Multiple imputation uses informing variables to determine statistical distributions for each piece of missing data. Then multiple datasets are created by randomly drawing on the distributions for each piece of missing data. Since MI is efficient, only a limited number, usually less than 20, of imputed datasets are required to get stable estimates. Each imputed dataset is analyzed using standard statistical techniques, and then results are combined to get overall estimates of effect. A simulation study will be demonstrated to show the results of using the default complete case analysis, and MI in a linear regression of MCAR and MAR simulated data. Further, MI was successfully applied to the association study of CO2 levels and headaches when initial analysis showed there may be an underlying association between missing CO2 levels and reported headaches. Through MI, we were able to show that there is a strong association between average CO2 levels and the risk of headaches. Each unit increase in CO2 (mmHg) resulted in a doubling in the odds of reported headaches.
Xavier, A; Muir, William M; Rainey, Katy M
Success in genome-wide association studies and marker-assisted selection depends on good phenotypic and genotypic data. The more complete this data is, the more powerful will be the results of analysis. Nevertheless, there are next-generation technologies that seek to provide genotypic information in spite of great proportions of missing data. The procedures these technologies use to impute genetic data, therefore, greatly affect downstream analyses. This study aims to (1) compare the genetic variance in a single-nucleotide polymorphism panel of soybean with missing data imputed using various methods, (2) evaluate the imputation accuracy and post-imputation quality associated with these methods, and (3) evaluate the impact of imputation method on heritability and the accuracy of genome-wide prediction of soybean traits. The imputation methods we evaluated were as follows: multivariate mixed model, hidden Markov model, logical algorithm, k-nearest neighbor, single value decomposition, and random forest. We used raw genotypes from the SoyNAM project and the following phenotypes: plant height, days to maturity, grain yield, and seed protein composition. We propose an imputation method based on multivariate mixed models using pedigree information. Our methods comparison indicate that heritability of traits can be affected by the imputation method. Genotypes with missing values imputed with methods that make use of genealogic information can favor genetic analysis of highly polygenic traits, but not genome-wide prediction accuracy. The genotypic matrix captured the highest amount of genetic variance when missing loci were imputed by the method proposed in this paper. We concluded that hidden Markov models and random forest imputation are more suitable to studies that aim analyses of highly heritable traits while pedigree-based methods can be used to best analyze traits with low heritability. Despite the notable contribution to heritability, advantages in genomic
Darabi, Hatef; Beesley, Jonathan; Droit, Arnaud
for driving breast cancer risk (lead SNP rs2787486 (OR = 0.92; CI 0.90-0.94; P = 8.96 × 10(-15))) and are correlated with two previously reported risk-associated variants at this locus, SNPs rs6504950 (OR = 0.94, P = 2.04 × 10(-09), r(2) = 0.73 with lead SNP) and rs1156287 (OR = 0.93, P = 3.41 × 10(-11), r(2...
Full Text Available Alma B Pedersen,1 Ellen M Mikkelsen,1 Deirdre Cronin-Fenton,1 Nickolaj R Kristensen,1 Tra My Pham,2 Lars Pedersen,1 Irene Petersen1,2 1Department of Clinical Epidemiology, Aarhus University Hospital, Aarhus N, Denmark; 2Department of Primary Care and Population Health, University College London, London, UK Abstract: Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR, missing at random (MAR, and missing not at random (MNAR. In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data. Keywords: missing data, observational study, multiple imputation, MAR, MCAR, MNAR
Full Text Available Aims. The purpose of this study was to compare methods for handling missing data in analysis of the National Tuberculosis Surveillance System of the Centers for Disease Control and Prevention. Because of the high rate of missing human immunodeficiency virus (HIV infection status in this dataset, we used multiple imputation methods to minimize the bias that may result from less sophisticated methods. Methods. We compared analysis based on multiple imputation methods with analysis based on deleting subjects with missing covariate data from regression analysis (case exclusion, and determined whether the use of increasing numbers of imputed datasets would lead to changes in the estimated association between isoniazid resistance and death. Results. Following multiple imputation, the odds ratio for initial isoniazid resistance and death was 2.07 (95% CI 1.30, 3.29; with case exclusion, this odds ratio decreased to 1.53 (95% CI 0.83, 2.83. The use of more than 5 imputed datasets did not substantively change the results. Conclusions. Our experience with the National Tuberculosis Surveillance System dataset supports the use of multiple imputation methods in epidemiologic analysis, but also demonstrates that close attention should be paid to the potential impact of missing covariates at each step of the analysis.
McGinniss, J; Harel, O
Missing values present challenges in the analysis of data across many areas of research. Handling incomplete data incorrectly can lead to bias, over-confident intervals, and inaccurate inferences. One principled method of handling incomplete data is multiple imputation. This article considers incomplete data in which values are missing for three or more qualitatively different reasons and applies a modified multiple imputation framework in the analysis of that data. Included are a proof of the methodology used for three-stage multiple imputation with its limiting distribution, an extension to more than three types of missing values, an extension to the ignorability assumption with proof, and simulations demonstrating that the estimator is unbiased and efficient under the ignorability assumption.
Enders, Craig K; Mistler, Stephen A; Keller, Brian T
Although missing data methods have advanced in recent years, methodologists have devoted less attention to multilevel data structures where observations at level-1 are nested within higher-order organizational units at level-2 (e.g., individuals within neighborhoods; repeated measures nested within individuals; students nested within classrooms). Joint modeling and chained equations imputation are the principal imputation frameworks for single-level data, and both have multilevel counterparts. These approaches differ algorithmically and in their functionality; both are appropriate for simple random intercept analyses with normally distributed data, but they differ beyond that. The purpose of this paper is to describe multilevel imputation strategies and evaluate their performance in a variety of common analysis models. Using multiple imputation theory and computer simulations, we derive 4 major conclusions: (a) joint modeling and chained equations imputation are appropriate for random intercept analyses; (b) the joint model is superior for analyses that posit different within- and between-cluster associations (e.g., a multilevel regression model that includes a level-1 predictor and its cluster means, a multilevel structural equation model with different path values at level-1 and level-2); (c) chained equations imputation provides a dramatic improvement over joint modeling in random slope analyses; and (d) a latent variable formulation for categorical variables is quite effective. We use a real data analysis to demonstrate multilevel imputation, and we suggest a number of avenues for future research. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Abstract Background Epistatic miniarray profiling (E-MAPs) is a high-throughput approach capable of quantifying aggravating or alleviating genetic interactions between gene pairs. The datasets resulting from E-MAP experiments typically take the form of a symmetric pairwise matrix of interaction scores. These datasets have a significant number of missing values - up to 35% - that can reduce the effectiveness of some data analysis techniques and prevent the use of others. An effective method for imputing interactions would therefore increase the types of possible analysis, as well as increase the potential to identify novel functional interactions between gene pairs. Several methods have been developed to handle missing values in microarray data, but it is unclear how applicable these methods are to E-MAP data because of their pairwise nature and the significantly larger number of missing values. Here we evaluate four alternative imputation strategies, three local (Nearest neighbor-based) and one global (PCA-based), that have been modified to work with symmetric pairwise data. Results We identify different categories for the missing data based on their underlying cause, and show that values from the largest category can be imputed effectively. We compare local and global imputation approaches across a variety of distinct E-MAP datasets, showing that both are competitive and preferable to filling in with zeros. In addition we show that these methods are effective in an E-MAP from a different species, suggesting that pairwise imputation techniques will be increasingly useful as analogous epistasis mapping techniques are developed in different species. We show that strongly alleviating interactions are significantly more difficult to predict than strongly aggravating interactions. Finally we show that imputed interactions, generated using nearest neighbor methods, are enriched for annotations in the same manner as measured interactions. Therefore our method potentially
Multiple imputation is widely accepted as the method of choice to address item-nonresponse in surveys. However, research on imputation strategies for the hierarchical structures that are typically found in the data in educational contexts is still limited. While a multilevel imputation model should be preferred from a theoretical point of view if…
Lee, Taehun; Cai, Li
Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…
Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix
Katya L Masconi
Full Text Available Imputation techniques used to handle missing data are based on the principle of replacement. It is widely advocated that multiple imputation is superior to other imputation methods, however studies have suggested that simple methods for filling missing data can be just as accurate as complex methods. The objective of this study was to implement a number of simple and more complex imputation methods, and assess the effect of these techniques on the performance of undiagnosed diabetes risk prediction models during external validation.Data from the Cape Town Bellville-South cohort served as the basis for this study. Imputation methods and models were identified via recent systematic reviews. Models' discrimination was assessed and compared using C-statistic and non-parametric methods, before and after recalibration through simple intercept adjustment.The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4% had missing data. Family history had the highest proportion of missing data (25%. Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals. Overall, deletion resulted in the lowest model performances while simple imputation yielded the highest C-statistic for the Cambridge Diabetes Risk model, Kuwaiti Risk model, Omani Diabetes Risk model and Rotterdam Predictive model. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model, which were matched by simpler imputation methods.Deletion was confirmed as a poor technique for handling missing data. However, despite the emphasized disadvantages of simpler imputation methods, this study showed that implementing these methods results in similar predictive utility for undiagnosed diabetes when compared to multiple imputation.
Full Text Available Abstract Background Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness probabilities. Methods We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model and the other fits a logistic regression for predicting missingness probabilities (the missingness model. A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with
Landmark-Høyvik, Hege; Dumeaux, Vanessa; Nebdal, Daniel; Lund, Eiliv; Tost, Jörg; Kamatani, Yoichiro; Renault, Victor; Børresen-Dale, Anne-Lise; Kristensen, Vessela; Edvardsen, Hege
We investigated the effect of genetic variation on gene expression in blood from a cohort of BC survivors. Further, we investigated the associations that were specific for BC survivors by performing identical analyses for a group of healthy women and comparing the results. eQTL analysis was performed for 288 BC survivors (full data set). Further, using a subset of the data, eQTL analyses were performed on 288 BC survivors and on 81 healthy women separately and results were compared. A large number of associations were observed for the BC survivors, and the expression of human leukocyte antigen genes was found associated with SNPs in 100 genes. The comparison analyses with healthy women revealed associations occurring specifically in BC survivors, and the genes showed enrichment for immune system processes. The results suggest that the immune system has a different constitution in BC survivors compared to healthy women. © 2013 Elsevier Inc. All rights reserved.
Liou, Michelle; Cheng, Philip E.
Different data imputation techniques that are useful for equipercentile equating are discussed, and empirical data are used to evaluate the accuracy of these techniques as compared with chained equipercentile equating. The kernel estimator, the EM algorithm, the EB model, and the iterative moment estimator are considered. (SLD)
Ahmad, Meraj; Sinha, Anubhav; Ghosh, Sreya; Kumar, Vikrant; Davila, Sonia; Yajnik, Chittaranjan S; Chandak, Giriraj R
Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.
Taylor, Sandra L; Ruhaak, L Renee; Kelly, Karen; Weiss, Robert H; Kim, Kyoungmi
With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly. © The Author 2016. Published by Oxford University Press. For Permissions, please email: email@example.com.
Yozgatligil, Ceylan; Aslan, Sipan; Iyigun, Cem; Batmaz, Inci
This study aims to compare several imputation methods to complete the missing values of spatio-temporal meteorological time series. To this end, six imputation methods are assessed with respect to various criteria including accuracy, robustness, precision, and efficiency for artificially created missing data in monthly total precipitation and mean temperature series obtained from the Turkish State Meteorological Service. Of these methods, simple arithmetic average, normal ratio (NR), and NR weighted with correlations comprise the simple ones, whereas multilayer perceptron type neural network and multiple imputation strategy adopted by Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) are computationally intensive ones. In addition, we propose a modification on the EM-MCMC method. Besides using a conventional accuracy measure based on squared errors, we also suggest the correlation dimension (CD) technique of nonlinear dynamic time series analysis which takes spatio-temporal dependencies into account for evaluating imputation performances. Depending on the detailed graphical and quantitative analysis, it can be said that although computational methods, particularly EM-MCMC method, are computationally inefficient, they seem favorable for imputation of meteorological time series with respect to different missingness periods considering both measures and both series studied. To conclude, using the EM-MCMC algorithm for imputing missing values before conducting any statistical analyses of meteorological data will definitely decrease the amount of uncertainty and give more robust results. Moreover, the CD measure can be suggested for the performance evaluation of missing data imputation particularly with computational methods since it gives more precise results in meteorological time series.
Siddique, Juned; Reiter, Jerome P; Brincks, Ahnalee; Gibbons, Robert D; Crespi, Catherine M; Brown, C Hendricks
There are many advantages to individual participant data meta-analysis for combining data from multiple studies. These advantages include greater power to detect effects, increased sample heterogeneity, and the ability to perform more sophisticated analyses than meta-analyses that rely on published results. However, a fundamental challenge is that it is unlikely that variables of interest are measured the same way in all of the studies to be combined. We propose that this situation can be viewed as a missing data problem in which some outcomes are entirely missing within some trials and use multiple imputation to fill in missing measurements. We apply our method to five longitudinal adolescent depression trials where four studies used one depression measure and the fifth study used a different depression measure. None of the five studies contained both depression measures. We describe a multiple imputation approach for filling in missing depression measures that makes use of external calibration studies in which both depression measures were used. We discuss some practical issues in developing the imputation model including taking into account treatment group and study. We present diagnostics for checking the fit of the imputation model and investigate whether external information is appropriately incorporated into the imputed values. Copyright © 2015 John Wiley & Sons, Ltd.
Full Text Available Abstract Background The restriction fragment length polymorphism (RFLP is a common laboratory method for the genotyping of single nucleotide polymorphisms (SNPs. Here, we describe a web-based software, named SNP-RFLPing, which provides the restriction enzyme for RFLP assays on a batch of SNPs and genes from the human, rat, and mouse genomes. Results Three user-friendly inputs are included: 1 NCBI dbSNP "rs" or "ss" IDs; 2 NCBI Entrez gene ID and HUGO gene name; 3 any formats of SNP-in-sequence, are allowed to perform the SNP-RFLPing assay. These inputs are auto-programmed to SNP-containing sequences and their complementary sequences for the selection of restriction enzymes. All SNPs with available RFLP restriction enzymes of each input genes are provided even if many SNPs exist. The SNP-RFLPing analysis provides the SNP contig position, heterozygosity, function, protein residue, and amino acid position for cSNPs, as well as commercial and non-commercial restriction enzymes. Conclusion This web-based software solves the input format problems in similar softwares and greatly simplifies the procedure for providing the RFLP enzyme. Mixed free forms of input data are friendly to users who perform the SNP-RFLPing assay. SNP-RFLPing offers a time-saving application for association studies in personalized medicine and is freely available at http://bio.kuas.edu.tw/snp-rflp/.
Chang, Hsueh-Wei; Yang, Cheng-Hong; Chang, Phei-Lang; Cheng, Yu-Huei; Chuang, Li-Yeh
The restriction fragment length polymorphism (RFLP) is a common laboratory method for the genotyping of single nucleotide polymorphisms (SNPs). Here, we describe a web-based software, named SNP-RFLPing, which provides the restriction enzyme for RFLP assays on a batch of SNPs and genes from the human, rat, and mouse genomes. Three user-friendly inputs are included: 1) NCBI dbSNP "rs" or "ss" IDs; 2) NCBI Entrez gene ID and HUGO gene name; 3) any formats of SNP-in-sequence, are allowed to perform the SNP-RFLPing assay. These inputs are auto-programmed to SNP-containing sequences and their complementary sequences for the selection of restriction enzymes. All SNPs with available RFLP restriction enzymes of each input genes are provided even if many SNPs exist. The SNP-RFLPing analysis provides the SNP contig position, heterozygosity, function, protein residue, and amino acid position for cSNPs, as well as commercial and non-commercial restriction enzymes. This web-based software solves the input format problems in similar softwares and greatly simplifies the procedure for providing the RFLP enzyme. Mixed free forms of input data are friendly to users who perform the SNP-RFLPing assay. SNP-RFLPing offers a time-saving application for association studies in personalized medicine and is freely available at http://bio.kuas.edu.tw/snp-rflp/.
Full Text Available In studies that use electronic health record data, imputation of important data elements such as Glycated hemoglobin (A1c has become common. However, few studies have systematically examined the validity of various imputation strategies for missing A1c values. We derived a complete dataset using an incident diabetes population that has no missing values in A1c, fasting and random plasma glucose (FPG and RPG, age, and gender. We then created missing A1c values under two assumptions: missing completely at random (MCAR and missing at random (MAR. We then imputed A1c values, compared the imputed values to the true A1c values, and used these data to assess the impact of A1c on initiation of antihyperglycemic therapy. Under MCAR, imputation of A1c based on FPG 1 estimated a continuous A1c within ± 1.88% of the true A1c 68.3% of the time; 2 estimated a categorical A1c within ± one category from the true A1c about 50% of the time. Including RPG in imputation slightly improved the precision but did not improve the accuracy. Under MAR, including gender and age in addition to FPG improved the accuracy of imputed continuous A1c but not categorical A1c. Moreover, imputation of up to 33% of missing A1c values did not change the accuracy and precision and did not alter the impact of A1c on initiation of antihyperglycemic therapy. When using A1c values as a predictor variable, a simple imputation algorithm based only on age, sex, and fasting plasma glucose gave acceptable results.
Oren E Livne
Full Text Available Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm, a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs, from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.
Dassonneville, R; Brøndum, Rasmus Froberg; Druet, T
for prediction of DGV and in France using a genomic marker-assisted selection approach for prediction of GEBV. Imputation in both studies was done using a combination of the DAGPHASE 1.1 and Beagle 2.1.3 software. Traits considered were protein yield, fertility, somatic cell count, and udder depth. Imputation...... of missing markers and prediction of breeding values were performed using 2 different reference populations in each country: either a national reference population or a combined EuroGenomics reference population. Validation for accuracy of imputation and genomic prediction was done based on national test...
Full Text Available Abstract Background Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS. Methods 1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation. Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1 multiple imputation, 2 single regression, 3 individual mean, 4 overall mean, 5 participant's preceding response, and 6 random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated. Results When 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89, although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range
Yin, Xiaoyan; Levy, Daniel; Willinger, Christine; Adourian, Aram; Larson, Martin G
Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case-control study of 135 incident cases of myocardial infarction and 135 pair-matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case-control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤ 40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A
Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.
Kim, Soeun; Belin, Thomas R; Sugar, Catherine A
This paper investigates multiple imputation methods for regression models with interacting continuous and binary predictors when continuous variable may be missing. Usual implementations for parametric multiple imputation assume a multivariate normal structure for the variables, which is not satisfied for a binary variable nor its interaction with a continuous variable. To accommodate interactions, missing covariates are multiply imputed from conditional distribution in a manner consistent with the joint model. Alternative imputation methods under multivariate normal assumptions are also considered as candidate approximations and evaluated in a simulation study. The results suggest that the joint modeling procedure performs generally well across a wide range of scenarios and so do the approximation methods that incorporate interactions in the model appropriately by stratification. It is critical to include interactions in the imputation model as failure to do so may result in low coverage and bias. We apply the joint modeling approach and approximation methods in the study of childhood trauma with gender × trauma interaction. © The Author(s) 2016.
Edwards, Jessie K; Cole, Stephen R; Troester, Melissa A; Richardson, David B
Outcome misclassification is widespread in epidemiology, but methods to account for it are rarely used. We describe the use of multiple imputation to reduce bias when validation data are available for a subgroup of study participants. This approach is illustrated using data from 308 participants in the multicenter Herpetic Eye Disease Study between 1992 and 1998 (48% female; 85% white; median age, 49 years). The odds ratio comparing the acyclovir group with the placebo group on the gold-standard outcome (physician-diagnosed herpes simplex virus recurrence) was 0.62 (95% confidence interval (CI): 0.35, 1.09). We masked ourselves to physician diagnosis except for a 30% validation subgroup used to compare methods. Multiple imputation (odds ratio (OR) = 0.60; 95% CI: 0.24, 1.51) was compared with naive analysis using self-reported outcomes (OR = 0.90; 95% CI: 0.47, 1.73), analysis restricted to the validation subgroup (OR = 0.57; 95% CI: 0.20, 1.59), and direct maximum likelihood (OR = 0.62; 95% CI: 0.26, 1.53). In simulations, multiple imputation and direct maximum likelihood had greater statistical power than did analysis restricted to the validation subgroup, yet all 3 provided unbiased estimates of the odds ratio. The multiple-imputation approach was extended to estimate risk ratios using log-binomial regression. Multiple imputation has advantages regarding flexibility and ease of implementation for epidemiologists familiar with missing data methods.
Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.
Full Text Available Abstract Background Attrition, which leads to missing data, is a common problem in cluster randomized trials (CRTs, where groups of patients rather than individuals are randomized. Standard multiple imputation (MI strategies may not be appropriate to impute missing data from CRTs since they assume independent data. In this paper, under the assumption of missing completely at random and covariate dependent missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study. Method We considered three within-cluster and three across-cluster MI strategies for missing binary outcomes in CRTs. The three within-cluster MI strategies are logistic regression method, propensity score method, and Markov chain Monte Carlo (MCMC method, which apply standard MI strategies within each cluster. The three across-cluster MI strategies are propensity score method, random-effects (RE logistic regression approach, and logistic regression with cluster as a fixed effect. Based on the community hypertension assessment trial (CHAT which has complete data, we designed a simulation study to investigate the performance of above MI strategies. Results The estimated treatment effect and its 95% confidence interval (CI from generalized estimating equations (GEE model based on the CHAT complete dataset are 1.14 (0.76 1.70. When 30% of binary outcome are missing completely at random, a simulation study shows that the estimated treatment effects and the corresponding 95% CIs from GEE model are 1.15 (0.76 1.75 if complete case analysis is used, 1.12 (0.72 1.73 if within-cluster MCMC method is used, 1.21 (0.80 1.81 if across-cluster RE logistic regression is used, and 1.16 (0.82 1.64 if standard logistic regression which does not account for clustering is used. Conclusion When the percentage of missing data is low or intra
This paper demonstrates the application of multiple imputations by chained equations and time series forecasting of wind speed data. The study was motivated by the high prevalence of missing wind speed historic data. Findings based on the fully conditional specification under multiple imputations by chained equations, provided reliable wind speed missing data imputations. Further, the forecasting model shows, the smoothing parameter, alpha (0.014) close to zero, confirming that recent past observations are more suitable for use to forecast wind speeds. The maximum decadal wind speed for Entebbe International Airport was estimated to be 17.6 metres per second at a 0.05 level of significance with a bound on the error of estimation of 10.8 metres per second. The large bound on the error of estimations confirms the dynamic tendencies of wind speed at the airport under study.
Zhang, Zhaoyang; Fang, Hua
Disentangling patients' behavioral variations is a critical step for better understanding an intervention's effects on individual outcomes. Missing data commonly exist in longitudinal behavioral intervention studies. Multiple imputation (MI) has been well studied for missing data analyses in the statistical field, however, has not yet been scrutinized for clustering or unsupervised learning, which are important techniques for explaining the heterogeneity of treatment effects. Built upon previous work on MI fuzzy clustering, this paper theoretically, empirically and numerically demonstrate how MI-based approach can reduce the uncertainty of clustering accuracy in comparison to non-and single-imputation based clustering approach. This paper advances our understanding of the utility and strength of multiple-imputation (MI) based fuzzy clustering approach to processing incomplete longitudinal behavioral intervention data.
Concepción Crespo Turrado
Full Text Available Nowadays, data collection is a key process in the study of electrical power networks when searching for harmonics and a lack of balance among phases. In this context, the lack of data of any of the main electrical variables (phase-to-neutral voltage, phase-to-phase voltage, and current in each phase and power factor adversely affects any time series study performed. When this occurs, a data imputation process must be accomplished in order to substitute the data that is missing for estimated values. This paper presents a novel missing data imputation method based on multivariate adaptive regression splines (MARS and compares it with the well-known technique called multivariate imputation by chained equations (MICE. The results obtained demonstrate how the proposed method outperforms the MICE algorithm.
Full Text Available Integrality and validity of industrial data are the fundamental factors in the domain of data-driven modeling. Aiming at the data missing problem of gas flow in steel industry, an improved Generalized-Trend-Diffusion (iGTD algorithm is proposed in this study, where in particular it considers the sort of problem with data properties of consecutively missing and small samples. And, the imputation accuracy can be greatly increased by the proposed Gaussian membership-based GTD which expands the useful knowledge of data samples. In addition, the imputation order is further discussed to enhance the sequential forecasting accuracy of gas flow. To verify the effectiveness of the proposed method, a series of experiments that consists of three categories of data features in the gas system is presented, and the results indicate that this method is comprehensively better for the imputation of the periodical-like data and the time-series-like data.
Kim, Young Jin; Lee, Juyoung; Kim, Bong-Jo; Park, Taesung; Abecasis, Gonçalo; Almeida, Marcio; Altshuler, David; Asimit, Jennifer L.; Atzmon, Gil; Barber, Mathew; Barzilai, Nir; Beer, Nicola L.; Bell, Graeme I.; Below, Jennifer; Blackwell, Tom; Blangero, John; Boehnke, Michael; Bowden, Donald W.; Burtt, Noël; Chambers, John; Chen, Han; Chen, Peng; Chines, Peter S.; Choi, Sungkyoung; Churchhouse, Claire; Cingolani, Pablo; Cornes, Belinda K.; Cox, Nancy; Day-Williams, Aaron G.; Duggirala, Ravindranath; Dupuis, Josée; Dyer, Thomas; Feng, Shuang; Fernandez-Tajes, Juan; Ferreira, Teresa; Fingerlin, Tasha E.; Flannick, Jason; Florez, Jose; Fontanillas, Pierre; Frayling, Timothy M.; Fuchsberger, Christian; Gamazon, Eric R.; Gaulton, Kyle; Ghosh, Saurabh; Glaser, Benjamin; Gloyn, Anna; Grossman, Robert L.; Grundstad, Jason; Hanis, Craig; Heath, Allison; Highland, Heather; Horikoshi, Momoko; Huh, Ik-Soo; Huyghe, Jeroen R.; Ikram, Kamran; Jablonski, Kathleen A.; Jun, Goo; Kato, Norihiro; Kim, Jayoun; King, C. Ryan; Kooner, Jaspal; Kwon, Min-Seok; Im, Hae Kyung; Laakso, Markku; Lam, Kevin Koi-Yau; Lee, Jaehoon; Lee, Selyeong; Lee, Sungyoung; Lehman, Donna M.; Li, Heng; Lindgren, Cecilia M.; Liu, Xuanyao; Livne, Oren E.; Locke, Adam E.; Mahajan, Anubha; Maller, Julian B.; Manning, Alisa K.; Maxwell, Taylor J.; Mazoure, Alexander; McCarthy, Mark I.; Meigs, James B.; Min, Byungju; Mohlke, Karen L.; Morris, Andrew P.; Musani, Solomon; Nagai, Yoshihiko; Ng, Maggie C. Y.; Nicolae, Dan; Oh, Sohee; Palmer, Nicholette; Pollin, Toni I.; Prokopenko, Inga; Reich, David; Rivas, Manuel A.; Scott, Laura J.; Seielstad, Mark; Cho, Yoon Shin; Sim, Xueling; Sladek, Robert; Smith, Philip; Tachmazidou, Ioanna; Tai, E. Shyong; teo, Yik Ying; Teslovich, Tanya M.; Torres, Jason; Trubetskoy, Vasily; Willems, Sara M.; Williams, Amy L.; Wilson, James G.; Wiltshire, Steven; Won, Sungho; Wood, Andrew R.; Xu, Wang; Yoon, Joon; Zawistowski, Matthew; Zeggini, Eleftheria; Zhang, Weihua; Zöllner, Sebastian
Background: Rare variants have gathered increasing attention as a possible alternative source of missing heritability. Since next generation sequencing technology is not yet cost-effective for large-scale genomic studies, a widely used alternative approach is imputation. However, the imputation
Harold S.J. Zald; Janet L. Ohmann; Heather M. Roberts; Matthew J. Gregory; Emilie B. Henderson; Robert J. McGaughey; Justin. Braaten
This study investigated how lidar-derived vegetation indices, disturbance history from Landsat time series (LTS) imagery, plot location accuracy, and plot size influenced accuracy of statistical spatial models (nearest-neighbor imputation maps) of forest vegetation composition and structure. Nearest-neighbor (NN) imputation maps were developed for 539,000 ha in the...
Background We explored the imputation performance of the program IMPUTE in an admixed sample from Mexico City. The following issues were evaluated: (a) the impact of different reference panels (HapMap vs. 1000 Genomes) on imputation; (b) potential differences in imputation performance between single-step vs. two-step (phasing and imputation) approaches; (c) the effect of different INFO score thresholds on imputation performance and (d) imputation performance in common vs. rare markers. Methods The sample from Mexico City comprised 1,310 individuals genotyped with the Affymetrix 5.0 array. We randomly masked 5% of the markers directly genotyped on chromosome 12 (n = 1,046) and compared the imputed genotypes with the microarray genotype calls. Imputation was carried out with the program IMPUTE. The concordance rates between the imputed and observed genotypes were used as a measure of imputation accuracy and the proportion of non-missing genotypes as a measure of imputation efficacy. Results The single-step imputation approach produced slightly higher concordance rates than the two-step strategy (99.1% vs. 98.4% when using the HapMap phase II combined panel), but at the expense of a lower proportion of non-missing genotypes (85.5% vs. 90.1%). The 1,000 Genomes reference sample produced similar concordance rates to the HapMap phase II panel (98.4% for both datasets, using the two-step strategy). However, the 1000 Genomes reference sample increased substantially the proportion of non-missing genotypes (94.7% vs. 90.1%). Rare variants (Mexico City, which has primarily Native American (62%) and European (33%) contributions. Genotype concordances were higher than 98.4% using all the imputation strategies, in spite of the fact that no Native American samples are present in the HapMap and 1000 Genomes reference panels. The best balance of imputation accuracy and efficiency was obtained with the 1,000 Genomes panel. Rare variants were not captured effectively by any of
Vukcevic, Damjan; Traherne, James A; Næss, Sigrid; Ellinghaus, Eva; Kamatani, Yoichiro; Dilthey, Alexander; Lathrop, Mark; Karlsen, Tom H; Franke, Andre; Moffatt, Miriam; Cookson, William; Trowsdale, John; McVean, Gil; Sawcer, Stephen; Leslie, Stephen
Large population studies of immune system genes are essential for characterizing their role in diseases, including autoimmune conditions. Of key interest are a group of genes encoding the killer cell immunoglobulin-like receptors (KIRs), which have known and hypothesized roles in autoimmune diseases, resistance to viruses, reproductive conditions, and cancer. These genes are highly polymorphic, which makes typing expensive and time consuming. Consequently, despite their importance, KIRs have been little studied in large cohorts. Statistical imputation methods developed for other complex loci (e.g., human leukocyte antigen [HLA]) on the basis of SNP data provide an inexpensive high-throughput alternative to direct laboratory typing of these loci and have enabled important findings and insights for many diseases. We present KIR∗IMP, a method for imputation of KIR copy number. We show that KIR∗IMP is highly accurate and thus allows the study of KIRs in large cohorts and enables detailed investigation of the role of KIRs in human disease. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Vukcevic, Damjan; Traherne, James A.; Næss, Sigrid; Ellinghaus, Eva; Kamatani, Yoichiro; Dilthey, Alexander; Lathrop, Mark; Karlsen, Tom H.; Franke, Andre; Moffatt, Miriam; Cookson, William; Trowsdale, John; McVean, Gil; Sawcer, Stephen; Leslie, Stephen
Large population studies of immune system genes are essential for characterizing their role in diseases, including autoimmune conditions. Of key interest are a group of genes encoding the killer cell immunoglobulin-like receptors (KIRs), which have known and hypothesized roles in autoimmune diseases, resistance to viruses, reproductive conditions, and cancer. These genes are highly polymorphic, which makes typing expensive and time consuming. Consequently, despite their importance, KIRs have been little studied in large cohorts. Statistical imputation methods developed for other complex loci (e.g., human leukocyte antigen [HLA]) on the basis of SNP data provide an inexpensive high-throughput alternative to direct laboratory typing of these loci and have enabled important findings and insights for many diseases. We present KIR∗IMP, a method for imputation of KIR copy number. We show that KIR∗IMP is highly accurate and thus allows the study of KIRs in large cohorts and enables detailed investigation of the role of KIRs in human disease. PMID:26430804
Willer, Cristen J; Bonnycastle, Lori L; Conneely, Karen N; Duren, William L; Jackson, Anne U; Scott, Laura J; Narisu, Narisu; Chines, Peter S; Skol, Andrew; Stringham, Heather M; Petrie, John; Erdos, Michael R; Swift, Amy J; Enloe, Sareena T; Sprau, Andrew G; Smith, Eboni; Tong, Maurine; Doheny, Kimberly F; Pugh, Elizabeth W; Watanabe, Richard M; Buchanan, Thomas A; Valle, Timo T; Bergman, Richard N; Tuomilehto, Jaakko; Mohlke, Karen L; Collins, Francis S; Boehnke, Michael
More than 120 published reports have described associations between single nucleotide polymorphisms (SNPs) and type 2 diabetes. However, multiple studies of the same variant have often been discordant. From a literature search, we identified previously reported type 2 diabetes-associated SNPs. We initially genotyped 134 SNPs on 786 index case subjects from type 2 diabetes families and 617 control subjects with normal glucose tolerance from Finland and excluded from analysis 20 SNPs in strong linkage disequilibrium (r(2) > 0.8) with another typed SNP. Of the 114 SNPs examined, we followed up the 20 most significant SNPs (P < 0.10) on an additional 384 case subjects and 366 control subjects from a population-based study in Finland. In the combined data, we replicated association (P < 0.05) for 12 SNPs: PPARG Pro12Ala and His447, KCNJ11 Glu23Lys and rs5210, TNF -857, SLC2A2 Ile110Thr, HNF1A/TCF1 rs2701175 and GE117881_360, PCK1 -232, NEUROD1 Thr45Ala, IL6 -598, and ENPP1 Lys121Gln. The replication of 12 SNPs of 114 tested was significantly greater than expected by chance under the null hypothesis of no association (P = 0.012). We observed that SNPs from genes that had three or more previous reports of association were significantly more likely to be replicated in our sample (P = 0.03), although we also replicated 4 of 58 SNPs from genes that had only one previous report of association.
Siew, Edward D; Peterson, Josh F; Eden, Svetlana K; Moons, Karel G; Ikizler, T Alp; Matheny, Michael E
Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m(2) (eGFR 75). From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a "missing" data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; Pmultiple imputation + serum creatinine) (15.3%) versus eGFR 75 (40.5%; PMultiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods.
Lee, Minjung; Dignam, James J.; Han, Junhee
We propose a nonparametric approach for cumulative incidence estimation when causes of failure are unknown or missing for some subjects. Under the missing at random assumption, we estimate the cumulative incidence function using multiple imputation methods. We develop asymptotic theory for the cumulative incidence estimators obtained from multiple imputation methods. We also discuss how to construct confidence intervals for the cumulative incidence function and perform a test for comparing the cumulative incidence functions in two samples with missing cause of failure. Through simulation studies, we show that the proposed methods perform well. The methods are illustrated with data from a randomized clinical trial in early stage breast cancer. PMID:25043107
Välikangas, Tommi; Suomi, Tomi; Elo, Laura L
Label-free mass spectrometry (MS) has developed into an important tool applied in various fields of biological and life sciences. Several software exist to process the raw MS data into quantified protein abundances, including open source and commercial solutions. Each software includes a set of unique algorithms for different tasks of the MS data processing workflow. While many of these algorithms have been compared separately, a thorough and systematic evaluation of their overall performance is missing. Moreover, systematic information is lacking about the amount of missing values produced by the different proteomics software and the capabilities of different data imputation methods to account for them.In this study, we evaluated the performance of five popular quantitative label-free proteomics software workflows using four different spike-in data sets. Our extensive testing included the number of proteins quantified and the number of missing values produced by each workflow, the accuracy of detecting differential expression and logarithmic fold change and the effect of different imputation and filtering methods on the differential expression results. We found that the Progenesis software performed consistently well in the differential expression analysis and produced few missing values. The missing values produced by the other software decreased their performance, but this difference could be mitigated using proper data filtering or imputation methods. Among the imputation methods, we found that the local least squares (lls) regression imputation consistently increased the performance of the software in the differential expression analysis, and a combination of both data filtering and local least squares imputation increased performance the most in the tested data sets. © The Author 2017. Published by Oxford University Press.
Liu, Yu; Enders, Craig K
In Ordinary Least Square regression, researchers often are interested in knowing whether a set of parameters is different from zero. With complete data, this could be achieved using the gain in prediction test, hierarchical multiple regression, or an omnibus F test. However, in substantive research scenarios, missing data often exist. In the context of multiple imputation, one of the current state-of-art missing data strategies, there are several different analogous multi-parameter tests of the joint significance of a set of parameters, and these multi-parameter test statistics can be referenced to various distributions to make statistical inferences. However, little is known about the performance of these tests, and virtually no research study has compared the Type 1 error rates and statistical power of these tests in scenarios that are typical of behavioral science data (e.g., small to moderate samples, etc.). This paper uses Monte Carlo simulation techniques to examine the performance of these multi-parameter test statistics for multiple imputation under a variety of realistic conditions. We provide a number of practical recommendations for substantive researchers based on the simulation results, and illustrate the calculation of these test statistics with an empirical example.
Kabisch, Maria; Hamann, Ute; Lorenzo Bermejo, Justo
Genotypes not directly measured in genetic studies are often imputed to improve statistical power and to increase mapping resolution. The accuracy of standard imputation techniques strongly depends on the similarity of linkage disequilibrium (LD) patterns in the study and reference populations. Here we develop a novel approach for genotype imputation in low-recombination regions that relies on the coalescent and permits to explicitly account for population demographic factors. To test the new method, study and reference haplotypes were simulated and gene trees were inferred under the basic coalescent and also considering population growth and structure. The reference haplotypes that first coalesced with study haplotypes were used as templates for genotype imputation. Computer simulations were complemented with the analysis of real data. Genotype concordance rates were used to compare the accuracies of coalescent-based and standard (IMPUTE2) imputation. Simulations revealed that, in LD-blocks, imputation accuracy relying on the basic coalescent was higher and less variable than with IMPUTE2. Explicit consideration of population growth and structure, even if present, did not practically improve accuracy. The advantage of coalescent-based over standard imputation increased with the minor allele frequency and it decreased with population stratification. Results based on real data indicated that, even in low-recombination regions, further research is needed to incorporate recombination in coalescence inference, in particular for studies with genetically diverse and admixed individuals. To exploit the full potential of coalescent-based methods for the imputation of missing genotypes in genetic studies, further methodological research is needed to reduce computer time, to take into account recombination, and to implement these methods in user-friendly computer programs. Here we provide reproducible code which takes advantage of publicly available software to facilitate
Full Text Available The objective of this study was to investigate the accuracy of imputation from low density (LDC to moderate density SNP chips (MDC in a Thai Holstein-Other multibreed dairy cattle population. Dairy cattle with complete pedigree information (n = 1,244 from 145 dairy farms were genotyped with GeneSeek GGP20K (n = 570, GGP26K (n = 540 and GGP80K (n = 134 chips. After checking for single nucleotide polymorphism (SNP quality, 17,779 SNP markers in common between the GGP20K, GGP26K, and GGP80K were used to represent MDC. Animals were divided into two groups, a reference group (n = 912 and a test group (n = 332. The SNP markers chosen for the test group were those located in positions corresponding to GeneSeek GGP9K (n = 7,652. The LDC to MDC genotype imputation was carried out using three different software packages, namely Beagle 3.3 (population-based algorithm, FImpute 2.2 (combined family- and population-based algorithms and Findhap 4 (combined family- and population-based algorithms. Imputation accuracies within and across chromosomes were calculated as ratios of correctly imputed SNP markers to overall imputed SNP markers. Imputation accuracy for the three software packages ranged from 76.79% to 93.94%. FImpute had higher imputation accuracy (93.94% than Findhap (84.64% and Beagle (76.79%. Imputation accuracies were similar and consistent across chromosomes for FImpute, but not for Findhap and Beagle. Most chromosomes that showed either high (73% or low (80% imputation accuracies were the same chromosomes that had above and below average linkage disequilibrium (LD; defined here as the correlation between pairs of adjacent SNP within chromosomes less than or equal to 1 Mb apart. Results indicated that FImpute was more suitable than Findhap and Beagle for genotype imputation in this Thai multibreed population. Perhaps additional increments in imputation accuracy could be achieved by increasing the completeness of pedigree information.
In this study, the bovine CACNA2D1 gene was taken as a candidate gene for mastitis resistance. The objective of this study was to identify single nucleotide polymorphisms (SNPs) in the bovine CACNA2D1 gene and evaluate the association of these SNPs with mastitis in cattle. Through DNA sequencing and PCR-RFLP ...
Vaden, Kenneth I; Gebregziabher, Mulugeta; Kuchinsky, Stefanie E; Eckert, Mark A
Whole brain fMRI analyses rarely include the entire brain because of missing data that result from data acquisition limits and susceptibility artifact, in particular. This missing data problem is typically addressed by omitting voxels from analysis, which may exclude brain regions that are of theoretical interest and increase the potential for Type II error at cortical boundaries or Type I error when spatial thresholds are used to establish significance. Imputation could significantly expand statistical map coverage, increase power, and enhance interpretations of fMRI results. We examined multiple imputation for group level analyses of missing fMRI data using methods that leverage the spatial information in fMRI datasets for both real and simulated data. Available case analysis, neighbor replacement, and regression based imputation approaches were compared in a general linear model framework to determine the extent to which these methods quantitatively (effect size) and qualitatively (spatial coverage) increased the sensitivity of group analyses. In both real and simulated data analysis, multiple imputation provided 1) variance that was most similar to estimates for voxels with no missing data, 2) fewer false positive errors in comparison to mean replacement, and 3) fewer false negative errors in comparison to available case analysis. Compared to the standard analysis approach of omitting voxels with missing data, imputation methods increased brain coverage in this study by 35% (from 33,323 to 45,071 voxels). In addition, multiple imputation increased the size of significant clusters by 58% and number of significant clusters across statistical thresholds, compared to the standard voxel omission approach. While neighbor replacement produced similar results, we recommend multiple imputation because it uses an informed sampling distribution to deal with missing data across subjects that can include neighbor values and other predictors. Multiple imputation is
Sanchez, Juan J; Børsting, Claus; Hallenberg, Charlotte
We have developed a robust single nucleotide polymorphism (SNPs) typing assay with co-amplification of 25 DNA-fragments and the detection of 35 human Y chromosome SNPs. The sizes of the PCR products ranged from 79 to 186 base pairs. PCR primers were designed to have a theoretical Tm of 60 +/- 5 d...
Oskari Kilpeläinen, Tuomas; Lakka, Timo A; Laaksonen, David E
To study the associations of seven single-nucleotide polymorphisms (SNPs) in the peroxisome proliferator-activated receptor gamma (PPARG) gene with the conversion from impaired glucose tolerance (IGT) to type 2 diabetes (T2D), and the interactions of the SNPs with physical activity (PA)....
Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy. © 2014 Springer International Publishing Switzerland.
Mislevy, Robert J.
Multiple imputations for latent variables are constructed so that analyses treating them as true variables have the correct expectations for population characteristics. Analyzing multiple imputations in accordance with their construction yields correct estimates of population characteristics, whereas analyzing them as multiple indicators generally…
van Buuren, Stef; Groothuis-Oudshoorn, Catharina Gerarda Maria
The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which
Buuren, S. van; Groothuis-Oudshoorn, K.
Multivariate Imputation by Chained Equations (MICE) is the name of software for imputing incomplete multivariate data by Fully Conditional Speci cation (FCS). MICE V1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. MICE V1.0 introduced predictor selection, passive
... 12 Banks and Banking 4 2010-01-01 2010-01-01 false Imputation of causes. 367.9 Section 367.9 Banks... SUSPENSION AND EXCLUSION OF CONTRACTOR AND TERMINATION OF CONTRACTS § 367.9 Imputation of causes. (a) Where there is cause to suspend and/or exclude any affiliated business entity of the contractor, that conduct...
Kmetic, Andrew; Joseph, Lawrence; Berger, Claudie; Tenenhouse, Alan
Nonresponse bias is a concern in any epidemiologic survey in which a subset of selected individuals declines to participate. We reviewed multiple imputation, a widely applicable and easy to implement Bayesian methodology to adjust for nonresponse bias. To illustrate the method, we used data from the Canadian Multicentre Osteoporosis Study, a large cohort study of 9423 randomly selected Canadians, designed in part to estimate the prevalence of osteoporosis. Although subjects were randomly selected, only 42% of individuals who were contacted agreed to participate fully in the study. The study design included a brief questionnaire for those invitees who declined further participation in order to collect information on the major risk factors for osteoporosis. These risk factors (which included age, sex, previous fractures, family history of osteoporosis, and current smoking status) were then used to estimate the missing osteoporosis status for nonparticipants using multiple imputation. Both ignorable and nonignorable imputation models are considered. Our results suggest that selection bias in the study is of concern, but only slightly, in very elderly (age 80+ years), both women and men. Epidemiologists should consider using multiple imputation more often than is current practice.
Artigas, Maria Soler; Wain, Louise V.; Miller, Suzanne; Kheirallah, Abdul Kader; Huffman, Jennifer E.; Ntalla, Ioanna; Shrine, Nick; Obeidat, Ma'en; Trochet, Holly; McArdle, Wendy L.; Alves, Alexessander Couto; Hui, Jennie; Zhao, Jing Hua; Joshi, Peter K.; Teumer, Alexander; Albrecht, Eva; Imboden, Medea; Rawal, Rajesh; Lopez, Lorna M.; Marten, Jonathan; Enroth, Stefan; Surakka, Ida; Polasek, Ozren; Lyytikainen, Leo-Pekka; Granell, Raquel; Hysi, Pirro G.; Flexeder, Claudia; Mahajan, Anubha; Beilby, John; Bosse, Yohan; Brandsma, Corry-Anke; Campbell, Harry; Gieger, Christian; Glaeser, Sven; Gonzalez, Juan R.; Grallert, Harald; Hammond, Chris J.; Harris, Sarah E.; Hartikainen, Anna-Liisa; Heliovaara, Markku; Henderson, John; Hocking, Lynne; Horikoshi, Momoko; Hutri-Kahonen, Nina; Ingelsson, Erik; Johansson, Asa; Kemp, John P.; Kolcic, Ivana; Kumar, Ashish; Lind, Lars; Melen, Erik; Musk, Arthur W.; Navarro, Pau; Nickle, David C.; Padmanabhan, Sandosh; Raitakari, Olli T.; Ried, Janina S.; Ripatti, Samuli; Schulz, Holger; Scott, Robert A.; Sin, Don D.; Starr, John M.; Vinuela, Ana; Voelzke, Henry; Wild, Sarah H.; Wright, Alan F.; Zemunik, Tatijana; Jarvis, Deborah L.; Spector, Tim D.; Evans, David M.; Lehtimaki, Terho; Vitart, Veronique; Kahonen, Mika; Gyllensten, Ulf; Rudan, Igor; Deary, Ian J.; Karrasch, Stefan; Probst-Hensch, Nicole M.; Heinrich, Joachim; Stubbe, Beate; Wilson, James F.; Wareham, Nicholas J.; James, Alan L.; Morris, Andrew P.; Jarvelin, Marjo-Riitta; Hayward, Caroline; Sayers, Ian; Strachan, David P.; Hall, Ian P.; Tobin, Martin D.; Deloukas, Panos; Hansell, Anna L.; Hubbard, Richard; Jackson, Victoria E.; Marchini, Jonathan; Pavord, Ian; Thomson, Neil C.; Zeggini, Eleftheria
Lung function measures are used in the diagnosis of chronic obstructive pulmonary disease. In 38,199 European ancestry individuals, we studied genome-wide association of forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC) and FEV1/FVC with 1000 Genomes Project (phase 1)-imputed
Finch, W. Holmes
Multivariate analysis of variance (MANOVA) is widely used in educational research to compare means on multiple dependent variables across groups. Researchers faced with the problem of missing data often use multiple imputation of values in place of the missing observations. This study compares the performance of 2 methods for combining p values in…
Yoo, Jin Eun
This Monte Carlo study investigates the beneficiary effect of including auxiliary variables during estimation of confirmatory factor analysis models with multiple imputation. Specifically, it examines the influence of sample size, missing rates, missingness mechanism combinations, missingness types (linear or convex), and the absence or presence…
Twisk, J.; de Boer, M.; de Vente, W.; Heymans, M.
Background and Objectives: As a result of the development of sophisticated techniques, such as multiple imputation, the interest in handling missing data in longitudinal studies has increased enormously in past years. Within the field of longitudinal data analysis, there is a current debate on
Bouwman, A.C.; Veerkamp, R.F.
The aim of this study was to determine the consequences of splitting sequencing effort over multiple breeds for imputation accuracy from a high-density SNP chip towards whole-genome sequence. Such information would assist for instance numerical smaller cattle breeds, but also pig and chicken
MacNeil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G; van Hout, Hein; de Rooij, Sophia E; Heymans, Martijn W; Bosmans, Judith E
Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing
Macneil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G.; van Hout, Hein; de Rooij, Sophia E.; Heymans, Martijn W.; Bosmans, Judith E.
Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing
Siddique, Juned; de Chavez, Peter J; Howe, George; Cruden, Gracelyn; Brown, C Hendricks
Individual participant data (IPD) meta-analysis is a meta-analysis in which the individual-level data for each study are obtained and used for synthesis. A common challenge in IPD meta-analysis is when variables of interest are measured differently in different studies. The term harmonization has been coined to describe the procedure of placing variables on the same scale in order to permit pooling of data from a large number of studies. Using data from an IPD meta-analysis of 19 adolescent depression trials, we describe a multiple imputation approach for harmonizing 10 depression measures across the 19 trials by treating those depression measures that were not used in a study as missing data. We then apply diagnostics to address the fit of our imputation model. Even after reducing the scale of our application, we were still unable to produce accurate imputations of the missing values. We describe those features of the data that made it difficult to harmonize the depression measures and provide some guidelines for using multiple imputation for harmonization in IPD meta-analysis.
Abstract Background Nowadays, more and more clinical scales consisting in responses given by the patients to some items (Patient Reported Outcomes - PRO), are validated with models based on Item Response Theory, and more specifically, with a Rasch model. In the validation sample, presence of missing data is frequent. The aim of this paper is to compare sixteen methods for handling the missing data (mainly based on simple imputation) in the context of psychometric validation of PRO by a Rasch model. The main indexes used for validation by a Rasch model are compared. Methods A simulation study was performed allowing to consider several cases, notably the possibility for the missing values to be informative or not and the rate of missing data. Results Several imputations methods produce bias on psychometrical indexes (generally, the imputation methods artificially improve the psychometric qualities of the scale). In particular, this is the case with the method based on the Personal Mean Score (PMS) which is the most commonly used imputation method in practice. Conclusions Several imputation methods should be avoided, in particular PMS imputation. From a general point of view, it is important to use an imputation method that considers both the ability of the patient (measured for example by his\\/her score), and the difficulty of the item (measured for example by its rate of favourable responses). Another recommendation is to always consider the addition of a random process in the imputation method, because such a process allows reducing the bias. Last, the analysis realized without imputation of the missing data (available case analyses) is an interesting alternative to the simple imputation in this context.
Full Text Available While microarrays make it feasible to rapidly investigate many complex biological problems, their multistep fabrication has the proclivity for error at every stage. The standard tactic has been to either ignore or regard erroneous gene readings as missing values, though this assumption can exert a major influence upon postgenomic knowledge discovery methods like gene selection and gene regulatory network (GRN reconstruction. This has been the catalyst for a raft of new flexible imputation algorithms including local least square impute and the recent heuristic collateral missing value imputation, which exploit the biological transactional behaviour of functionally correlated genes to afford accurate missing value estimation. This paper examines the influence of missing value imputation techniques upon postgenomic knowledge inference methods with results for various algorithms consistently corroborating that instead of ignoring missing values, recycling microarray data by flexible and robust imputation can provide substantial performance benefits for subsequent downstream procedures.
Full Text Available While microarrays make it feasible to rapidly investigate many complex biological problems, their multistep fabrication has the proclivity for error at every stage. The standard tactic has been to either ignore or regard erroneous gene readings as missing values, though this assumption can exert a major influence upon postgenomic knowledge discovery methods like gene selection and gene regulatory network (GRN reconstruction. This has been the catalyst for a raft of new flexible imputation algorithms including local least square impute and the recent heuristic collateral missing value imputation, which exploit the biological transactional behaviour of functionally correlated genes to afford accurate missing value estimation. This paper examines the influence of missing value imputation techniques upon postgenomic knowledge inference methods with results for various algorithms consistently corroborating that instead of ignoring missing values, recycling microarray data by flexible and robust imputation can provide substantial performance benefits for subsequent downstream procedures.
Mehmood, Ansar; Murtaza, Ghulam
Nanotechnology opens an enormous scope of novel application in the fields of biotechnology and agricultural industries, because nanoparticles (NPs) have unique physicochemical properties, i.e. high surface area, high reactivity, tunable pore size and particle morphology. Present study was carried out to determine the role of silver NPs (SNPs) to improve yield of Pisum sativum L. SNPs (10-100 nm) were synthesised by green method using extract of Berberis lycium Royle. Pea seeds were soaked and seedling were foliage sprayed by 0, 30, 60 and 90 ppm SNPs. The experiment was arranged as split-split plot randomised complete block design with three replicates. The application of SNPs enhanced significantly number of seeds pod -1 , number of pods plant -1 , hundred seed weight, biological yield and green pod yield over control. The highest yield was found when 60 ppm SNPs were applied. However, exposure to 90 ppm SNPs, the yield of the pea decreased significantly as compared with 30 and 60 ppm. This research shows that SNPs have definite ability to improve growth and yield of crops. Nevertheless, a comprehensive experimentation is needed to establish the most appropriate concentration, size and mode of application of SNPs for higher growth and maximum yield of pea.
The Agricultural Health Study (AHS), a large prospective cohort, was designed to elucidate associations between pesticide use and other agricultural exposures and health outcomes. The cohort includes 57,310 pesticide applicators who were enrolled between 1993 and 1997 in Iowa and...
Nalls, Michael A.; Plagnol, Vincent; Hernandez, Dena G.; Sharma, Manu; Sheerin, Una-Marie; Saad, Mohamad; Simon-Sanchez, Javier; Schulte, Claudia; Lesage, Suzanne; Sveinbjornsdottir, Sigurlaug; Arepalli, Sampath; Barker, Roger; Ben-Shlomo, Yoav; Berendse, Henk W.; Berg, Daniela; Bhatia, Kailash; de Bie, Rob M. A.; Biffi, Alessandro; Bloem, Bas; Bochdanovits, Zoltan; Bonin, Michael; Bras, Jose M.; Brockmann, Kathrin; Brooks, Janet; Burn, David J.; Charlesworth, Gavin; Chen, Honglei; Chinnery, Patrick F.; Chong, Sean; Clarke, Carl E.; Cookson, Mark R.; Cooper, J. Mark; Corvol, Jean Christophe; Counsell, Carl; Damier, Philippe; Dartigues, Jean-Francois; Deloukas, Panos; Deuschl, Guenther; Dexter, David T.; van Dijk, Karin D.; Dillman, Allissa; Durif, Frank; Duerr, Alexandra; Edkins, Sarah; Evans, Jonathan R.; Foltynie, Thomas; Gao, Jianjun; Gardner, Michelle; Gibbs, J. Raphael; Goate, Alison; Gray, Emma; Guerreiro, Rita; Gustafsson, Omar; Harris, Clare; van Hilten, Jacobus J.; Hofman, Albert; Hollenbeck, Albert; Holton, Janice; Hu, Michele; Huang, Xuemei; Huber, Heiko; Hudson, Gavin; Hunt, Sarah E.; Huttenlocher, Johanna; Illig, Thomas; Jonsson, Palmi V.; Lambert, Jean-Charles; Langford, Cordelia; Lees, Andrew; Lichtner, Peter; Limousin, Patricia; Lopez, Grisel; Lorenz, Delia; McNeill, Alisdair; Moorby, Catriona; Moore, Matthew; Morris, Huw R.; Morrison, Karen E.; Mudanohwo, Ese; O'Sullivan, Sean S.; Pearson, Justin; Perlmutter, Joel S.; Petursson, Hjoervar; Pollak, Pierre; Post, Bart; Potter, Simon; Ravina, Bernard; Revesz, Tamas; Riess, Olaf; Rivadeneira, Fernando; Rizzu, Patrizia; Ryten, Mina; Sawcer, Stephen; Schapira, Anthony; Scheffer, Hans; Shaw, Karen; Shoulson, Ira; Sidransky, Ellen; Smith, Colin; Spencer, Chris C. A.; Stefansson, Hreinn; Stockton, Joanna D.; Strange, Amy; Talbot, Kevin; Tanner, Carlie M.; Tashakkori-Ghanbaria, Avazeh; Tison, Francois; Trabzuni, Daniah; Traynor, Bryan J.; Uitterlinden, Andre G.; Velseboer, Daan; Vidailhet, Marie; Walker, Robert; van de Warrenburg, Bart; Wickremaratchi, Mirdhu; Williams, Nigel; Williams-Gray, Caroline H.; Winder-Rhodes, Sophie; Stefansson, Kari; Martinez, Maria; Hardy, John; Heutink, Peter; Brice, Alexis; Gasser, Thomas; Singleton, Andrew B.; Wood, Nicholas W.
Background Genome-wide association studies (GWAS) for Parkinson's disease have linked two loci (MAPT and SNCA) to risk of Parkinson's disease. We aimed to identify novel risk loci for Parkinson's disease. Methods We did a meta-analysis of datasets from five Parkinson's disease GWAS from the USA and
Piccoli, Mario L; Brito, Luiz F; Braccini, José; Cardoso, Fernando F; Sargolzaei, Mehdi; Schenkel, Flávio S
Genomic selection (GS) has played an important role in cattle breeding programs. However, genotyping prices are still a challenge for implementation of GS in beef cattle and there is still a lack of information about the use of low-density Single Nucleotide Polymorphisms (SNP) chip panels for genomic predictions in breeds such as Brazilian Braford and Hereford. Therefore, this study investigated the effect of using imputed genotypes in the accuracy of genomic predictions for twenty economically important traits in Brazilian Braford and Hereford beef cattle. Various scenarios composed by different percentages of animals with imputed genotypes and different sizes of the training population were compared. De-regressed EBVs (estimated breeding values) were used as pseudo-phenotypes in a Genomic Best Linear Unbiased Prediction (GBLUP) model using two different mimicked panels derived from the 50 K (8 K and 15 K SNP panels), which were subsequently imputed to the 50 K panel. In addition, genomic prediction accuracies generated from a 777 K SNP (imputed from the 50 K SNP) were presented as another alternate scenario. The accuracy of genomic breeding values averaged over the twenty traits ranged from 0.38 to 0.40 across the different scenarios. The average losses in expected genomic estimated breeding values (GEBV) accuracy (accuracy obtained from the inverse of the mixed model equations) relative to the true 50 K genotypes ranged from -0.0007 to -0.0012 and from -0.0002 to -0.0005 when using the 50 K imputed from the 8 K or 15 K, respectively. When using the imputed 777 K panel the average losses in expected GEBV accuracy was -0.0021. The average gain in expected EBVs accuracy by including genomic information when compared to simple BLUP was between 0.02 and 0.03 across scenarios and traits. The percentage of animals with imputed genotypes in the training population did not significantly influence the validation accuracy. However, the size of the training
Full Text Available Abstract Background Survey data from low income countries on birth weight usually pose a persistent problem. The studies conducted on birth weight have acknowledged missing data on birth weight, but they are not included in the analysis. Furthermore, other missing data presented on determinants of birth weight are not addressed. Thus, this study tries to identify determinants that are associated with low birth weight (LBW using multiple imputation to handle missing data on birth weight and its determinants. Methods The child dataset from Nepal Demographic and Health Survey (NDHS, 2011 was utilized in this study. A total of 5,240 children were born between 2006 and 2011, out of which 87% had at least one measured variable missing and 21% had no recorded birth weight. All the analyses were carried out in R version 3.1.3. Transform-then impute method was applied to check for interaction between explanatory variables and imputed missing data. Survey package was applied to each imputed dataset to account for survey design and sampling method. Survey logistic regression was applied to identify the determinants associated with LBW. Results The prevalence of LBW was 15.4% after imputation. Women with the highest autonomy on their own health compared to those with health decisions involving husband or others (adjusted odds ratio (OR 1.87, 95% confidence interval (95% CI = 1.31, 2.67, and husband and women together (adjusted OR 1.57, 95% CI = 1.05, 2.35 were less likely to give birth to LBW infants. Mothers using highly polluting cooking fuels (adjusted OR 1.49, 95% CI = 1.03, 2.22 were more likely to give birth to LBW infants than mothers using non-polluting cooking fuels. Conclusion The findings of this study suggested that obtaining the prevalence of LBW from only the sample of measured birth weight and ignoring missing data results in underestimation.
Hwa, Hsiao-Lin; Wu, Lawrence Shih Hsin; Lin, Chun-Yen; Huang, Tsun-Ying; Yin, Hsiang-I; Tseng, Li-Hui; Lee, James Chun-I
Single nucleotide polymorphism (SNP) typing offers promise to forensic genetics. Various strategies and panels for analyzing SNP markers for individual identification have been published. However, the best panels with fewer identity SNPs for all major population groups are still under discussion. This study aimed to find more autosomal SNPs with high heterozygosity for individual identification among Asian populations. Ninety-six autosomal SNPs of 502 DNA samples from unrelated individuals of five population groups (208 Taiwanese Han, 83 Filipinos, 62 Thais, 69 Indonesians, and 80 individuals with European, Near Eastern, or South Asian ancestry) were analyzed using arrays in an initial screening, and 75 SNPs (group A, 46 newly selected SNPs; groups B, 29 SNPs based on a previous SNP panel) were selected for further statistical analyses. Some SNPs with high heterozygosity from Asian populations were identified. The combined random match probability of the best 40 and 45 SNPs was between 3.16 × 10(-17) and 7.75 × 10(-17) and between 2.33 × 10(-19) and 7.00 × 10(-19), respectively, in all five populations. These loci offer comparable power to short tandem repeats (STRs) for routine forensic profiling. In this study, we demonstrated the population genetic characteristics and forensic parameters of 75 SNPs with high heterozygosity from five population groups. This SNPs panel can provide valuable genotypic information and can be helpful in forensic casework for individual identification among these populations.
Peterson, Josh F.; Eden, Svetlana K.; Moons, Karel G.; Ikizler, T. Alp; Matheny, Michael E.
Summary Background and objectives Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m2 (eGFR 75). Design, setting, participants, & measurements From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a “missing” data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. Results All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; Pserum creatinine) (15.3%) versus eGFR 75 (40.5%; P<0.001). Multiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Conclusions Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods. PMID:23037980
Stef van Buuren
Full Text Available The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as anS-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice 2.9, which extends the functionality ofmice 1.0 in several ways. In mice 2.9, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. mice 2.9 adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs. Imputation of categorical data is improved in order to bypassproblems caused by perfect prediction. Special attention is paid to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. mice 2.9 can be downloaded from the Comprehensive R Archive Network. This article provides a hands-on, stepwise approach to solve applied incomplete data problems.
Jiao, S; Tiezzi, F; Huang, Y; Gray, K A; Maltecca, C
Obtaining accurate individual feed intake records is the key first step in achieving genetic progress toward more efficient nutrient utilization in pigs. Feed intake records collected by electronic feeding systems contain errors (erroneous and abnormal values exceeding certain cutoff criteria), which are due to feeder malfunction or animal-feeder interaction. In this study, we examined the use of a novel data-editing strategy involving multiple imputation to minimize the impact of errors and missing values on the quality of feed intake data collected by an electronic feeding system. Accuracy of feed intake data adjustment obtained from the conventional linear mixed model (LMM) approach was compared with 2 alternative implementations of multiple imputation by chained equation, denoted as MI (multiple imputation) and MICE (multiple imputation by chained equation). The 3 methods were compared under 3 scenarios, where 5, 10, and 20% feed intake error rates were simulated. Each of the scenarios was replicated 5 times. Accuracy of the alternative error adjustment was measured as the correlation between the true daily feed intake (DFI; daily feed intake in the testing period) or true ADFI (the mean DFI across testing period) and the adjusted DFI or adjusted ADFI. In the editing process, error cutoff criteria are used to define if a feed intake visit contains errors. To investigate the possibility that the error cutoff criteria may affect any of the 3 methods, the simulation was repeated with 2 alternative error cutoff values. Multiple imputation methods outperformed the LMM approach in all scenarios with mean accuracies of 96.7, 93.5, and 90.2% obtained with MI and 96.8, 94.4, and 90.1% obtained with MICE compared with 91.0, 82.6, and 68.7% using LMM for DFI. Similar results were obtained for ADFI. Furthermore, multiple imputation methods consistently performed better than LMM regardless of the cutoff criteria applied to define errors. In conclusion, multiple imputation
Hsu, Chiu-Hsieh; Taylor, Jeremy M G; Hu, Chengcheng
We consider the situation of estimating the marginal survival distribution from censored data subject to dependent censoring using auxiliary variables. We had previously developed a nonparametric multiple imputation approach. The method used two working proportional hazards (PH) models, one for the event times and the other for the censoring times, to define a nearest neighbor imputing risk set. This risk set was then used to impute failure times for censored observations. Here, we adapt the method to the situation where the event and censoring times follow accelerated failure time models and propose to use the Buckley-James estimator as the two working models. Besides studying the performances of the proposed method, we also compare the proposed method with two popular methods for handling dependent censoring through the use of auxiliary variables, inverse probability of censoring weighted and parametric multiple imputation methods, to shed light on the use of them. In a simulation study with time-independent auxiliary variables, we show that all approaches can reduce bias due to dependent censoring. The proposed method is robust to misspecification of either one of the two working models and their link function. This indicates that a working proportional hazards model is preferred because it is more cumbersome to fit an accelerated failure time model. In contrast, the inverse probability of censoring weighted method is not robust to misspecification of the link function of the censoring time model. The parametric imputation methods rely on the specification of the event time model. The approaches are applied to a prostate cancer dataset. Copyright © 2015 John Wiley & Sons, Ltd.
Shah, Nameeta; Teplitsky, Michael V.; Pennacchio, Len A.; Hugenholtz, Philip; Hamann, Bernd; Dubchak, Inna L.
Recent advances in sequencing technologies promise better diagnostics for many diseases as well as better understanding of evolution of microbial populations. Single Nucleotide Polymorphisms(SNPs) are established genetic markers that aid in the identification of loci affecting quantitative traits and/or disease in a wide variety of eukaryotic species. With today's technological capabilities, it is possible to re-sequence a large set of appropriate candidate genes in individuals with a given disease and then screen for causative mutations.In addition, SNPs have been used extensively in efforts to study the evolution of microbial populations, and the recent application of random shotgun sequencing to environmental samples makes possible more extensive SNP analysis of co-occurring and co-evolving microbial populations. The program is available at http://genome.lbl.gov/vista/snpvista.
Full Text Available Background: Attention Deficit Hyperactivity Disorder (ADHD is a prevalent neurodevelopmental disorder affecting children, adolescents, and adults. Its etiology is not well-understood, but it is increasingly believed to result from diverse pathophysiologies that affect the structure and function of specific brain circuits. Although one of the best-studied neurobiological abnormalities in ADHD is reduced fronto-striatal-cerebellar gray matter volume, its specific genetic correlates are largely unknown. Methods: In this study, T1-weighted MR images of brain structure were collected from 198 adolescents (63 ADHD-diagnosed. A multivariate parallel independent component analysis technique (Para-ICA identified imaging-genetic relationships between regional gray matter volume and single nucleotide polymorphism data. Results: Para-ICA analyses extracted 14 components from genetic data and 9 from MR data. An iterative cross-validation using randomly-chosen sub-samples indicated acceptable stability of these ICA solutions. A series of partial correlation analyses controlling for age, sex, and ethnicity revealed two genotype-phenotype component pairs significantly differed between ADHD and non-ADHD groups, after a Bonferroni correction for multiple comparisons. The brain phenotype component not only included structures frequently found to have abnormally low volume in previous ADHD studies, but was also significantly associated with ADHD differences in symptom severity and performance on cognitive tests frequently found to be impaired in patients diagnosed with the disorder. Pathway analysis of the genotype component identified several different biological pathways linked to these structural abnormalities in ADHD. Conclusions: Some of these pathways implicate well-known dopaminergic neurotransmission and neurodevelopment hypothesized to be abnormal in ADHD. Other more recently implicated pathways included glutamatergic and GABA-eric physiological systems
Langley, Richard G B; Reich, Kristian; Papavassilis, Charis; Fox, Todd; Gong, Yankun; Gu Ttner, Achim
BACKGROUND: An issue in long-term clinical trials of biologics in psoriasis is how to handle missing efficacy data. This methodological challenge may not be understood by clinicians, yet can have a significant effect on the interpretation of clinical trials. OBJECTIVE Evaluate the effects of different data imputation methods on apparent secukinumab response rates. METHODS: Post hoc analyses were conducted on efficacy data from 2 phase III, multicenter, randomized, double-blind trials (FIXTURE and ERASURE) of secukinumab in moderate to severe plaque psoriasis. Per study protocols, missing data were imputed using strict non-response imputation (NRI), a highly conservative method that assumes non-response for all missing data. Alternative imputation methods (observed data, last observation carried forward [LOCF], modified NRI, and multiple imputation [MI]) were applied in this analysis and the resultant response rates compared. RESULTS: Response rates obtained with each imputation method diverged increasingly over 52-weeks of follow-up. Strict NRI response estimates were consistently lower than those using the other methods. At week 52, Psoriasis Area and Severity Index (PASI) 90 rates for secukinumab 300 mg based on strict NRI were 9.2% (FIXTURE) and 8.7% (ERASURE) lower than estimates obtained using the least conservative method (observed data). Estimates obtained through LOCF and modified NRI were closest to those produced by MI, currently regarded as the most methodologically sophisticated approach available. CONCLUSION: Awareness of differences in assumptions and limitations among imputation methods is necessary for well-informed interpretation of trial data. J Drugs Dermatol. 2017;16(8):734-742..
Renato César Cardoso
Full Text Available The present article aims to analyze Arthur Schopenhauer's criticism of the postulation that freedom of the will is the condition of possibility of legal imputability. According to the philosopher, an intellectually determinable will, not an unconditioned will, is what would be the true enabler of state imputability. In conclusion, we argue that it is with the potential of change of the agent, and not with the culpability, that society and the state should be concerned. This means that, according to Schopenhauer, an alternative and deterministic conception like yours, contrary to what is often said, does not compromise, but enhances the imputability, which is why there is nothing to fear.
Full Text Available The importance of lipids for cell function and health has been widely recognized, e.g., a disorder in the lipid composition of cells has been related to atherosclerosis caused cardiovascular disease (CVD. Lipidomics analyses are characterized by large yet not a huge number of mutually correlated variables measured and their associations to outcomes are potentially of a complex nature. Differential network analysis provides a formal statistical method capable of inferential analysis to examine differences in network structures of the lipids under two biological conditions. It also guides us to identify potential relationships requiring further biological investigation. We provide a recipe to conduct permutation test on association scores resulted from partial least square regression with multiple imputed lipidomic data from the LUdwigshafen RIsk and Cardiovascular Health (LURIC study, particularly paying attention to the left-censored missing values typical for a wide range of data sets in life sciences. Left-censored missing values are low-level concentrations that are known to exist somewhere between zero and a lower limit of quantification. To make full use of the LURIC data with the missing values, we utilize state of the art multiple imputation techniques and propose solutions to the challenges that incomplete data sets bring to differential network analysis. The customized network analysis helps us to understand the complexities of the underlying biological processes by identifying lipids and lipid classes that interact with each other, and by recognizing the most important differentially expressed lipids between two subgroups of coronary artery disease (CAD patients, the patients that had a fatal CVD event and the ones who remained stable during two year follow-up.
Full Text Available Age at natural menopause (ANM is a complex trait with high heritability and is associated with several major hormonal-related diseases. Recently, several genome-wide association studies (GWAS, conducted exclusively among women of European ancestry, have discovered dozens of genetic loci influencing ANM. No study has been conducted to evaluate whether these findings can be generalized to Chinese women.We evaluated the index single nucleotide polymorphisms (SNPs in 19 GWAS-identified genetic susceptibility loci for ANM among 3,533 Chinese women who had natural menopause. We also investigated 3 additional SNPs which were in LD with the index SNP in European-ancestry but not in Asian-ancestry populations. Two genetic risk scores (GRS were calculated to summarize SNPs across multiple loci one for all SNPs tested (GRSall, and one for SNPs which showed association in our study (GRSsel. All 22 SNPs showed the same association direction as previously reported. Eight SNPs were nominally statistically significant with P≤0.05: rs4246511 (RHBDL2, rs12461110 (NLRP11, rs2307449 (POLG, rs12611091 (BRSK1, rs1172822 (BRSK1, rs365132 (UIMC1, rs2720044 (ASH2L, and rs7246479 (TMEM150B. Especially, SNPs rs4246511, rs365132, rs1172822, and rs7246479 remained significant even after Bonferroni correction. Significant associations were observed for GRS. Women in the highest quartile began menopause 0.7 years (P = 3.24×10(-9 and 0.9 years (P = 4.61×10(-11 later than those in the lowest quartile for GRSsel and GRSall, respectively.Among the 22 investigated SNPs, eight showed associations with ANM (P<0.05 in our Chinese population. Results from this study extend some recent GWAS findings to the Asian-ancestry population and may guide future efforts to identify genetic determination of menopause.
Montealegre, Jane R; Zhou, Renke; Amirian, E Susan; Scheurer, Michael E
Although birthplace data are routinely collected in the participating Surveillance, Epidemiology, and End Results (SEER) registries, such data are missing in a nonrandom manner for a large percentage of cases. This hinders analysis of nativity-related cancer disparities. In the current study, the authors evaluated multiple imputation of nativity status among Hispanic patients diagnosed with cervical, prostate, and colorectal cancer and demonstrated the effect of multiple imputation on apparent nativity disparities in survival. Multiple imputation by logistic regression was used to generate nativity values (US-born vs foreign-born) using a priori-defined variables. The accuracy of the method was evaluated among a subset of cases. Kaplan-Meier curves were used to illustrate the effect of imputation by comparing survival among US-born and foreign-born Hispanics, with and without imputation of nativity. Birthplace was missing for 31%, 49%, and 39%, respectively, of cases of cervical, prostate, and colorectal cancer. The sensitivity of the imputation strategy for detecting foreign-born status was ≥90% and the specificity was ≥86%. The agreement between the true and imputed values was ≥0.80 and the misclassification error was ≤10%. Kaplan-Meier survival curves indicated different associations between nativity and survival when nativity was imputed versus when cases with missing birthplace were omitted from the analysis. Multiple imputation using variables available in the SEER data file can be used to accurately detect foreign-born status. This simple strategy may help researchers to disaggregate analyses by nativity and uncover important nativity disparities in regard to cancer diagnosis, treatment, and survival. © 2013 American Cancer Society.
de Jong, Roel; van Buuren, Stef; Spiess, Martin
The sensitivity of multiple imputation methods to deviations from their distributional assumptions is investigated using simulations, where the parameters of scientific interest are the coefficients of a linear regression model, and values in predictor variables are missing at random. The
Full Text Available Genome-wide association studies (GWAS have demonstrated the ability to identify the strongest causal common variants in complex human diseases. However, to date, the massive data generated from GWAS have not been maximally explored to identify true associations that fail to meet the stringent level of association required to achieve genome-wide significance. Genetics of gene expression (GGE studies have shown promise towards identifying DNA variations associated with disease and providing a path to functionally characterize findings from GWAS. Here, we present the first empiric study to systematically characterize the set of single nucleotide polymorphisms associated with expression (eSNPs in liver, subcutaneous fat, and omental fat tissues, demonstrating these eSNPs are significantly more enriched for SNPs that associate with type 2 diabetes (T2D in three large-scale GWAS than a matched set of randomly selected SNPs. This enrichment for T2D association increases as we restrict to eSNPs that correspond to genes comprising gene networks constructed from adipose gene expression data isolated from a mouse population segregating a T2D phenotype. Finally, by restricting to eSNPs corresponding to genes comprising an adipose subnetwork strongly predicted as causal for T2D, we dramatically increased the enrichment for SNPs associated with T2D and were able to identify a functionally related set of diabetes susceptibility genes. We identified and validated malic enzyme 1 (Me1 as a key regulator of this T2D subnetwork in mouse and provided support for the association of this gene to T2D in humans. This integration of eSNPs and networks provides a novel approach to identify disease susceptibility networks rather than the single SNPs or genes traditionally identified through GWAS, thereby extracting additional value from the wealth of data currently being generated by GWAS.
Full Text Available Abstract Background Mitochondrial single nucleotide polymorphisms (mtSNPs constitute important data when trying to shed some light on human diseases and cancers. Unfortunately, providing relevant mtSNP genotyping information in mtDNA databases in a neatly organized and transparent visual manner still remains a challenge. Amongst the many methods reported for SNP genotyping, determining the restriction fragment length polymorphisms (RFLPs is still one of the most convenient and cost-saving methods. In this study, we prepared the visualization of the mtDNA genome in a way, which integrates the RFLP genotyping information with mitochondria related cancers and diseases in a user-friendly, intuitive and interactive manner. The inherent problem associated with mtDNA sequences in BLAST of the NCBI database was also solved. Description V-MitoSNP provides complete mtSNP information for four different kinds of inputs: (1 color-coded visual input by selecting genes of interest on the genome graph, (2 keyword search by locus, disease and mtSNP rs# ID, (3 visualized input of nucleotide range by clicking the selected region of the mtDNA sequence, and (4 sequences mtBLAST. The V-MitoSNP output provides 500 bp (base pairs flanking sequences for each SNP coupled with the RFLP enzyme and the corresponding natural or mismatched primer sets. The output format enables users to see the SNP genotype pattern of the RFLP by virtual electrophoresis of each mtSNP. The rate of successful design of enzymes and primers for RFLPs in all mtSNPs was 99.1%. The RFLP information was validated by actual agarose electrophoresis and showed successful results for all mtSNPs tested. The mtBLAST function in V-MitoSNP provides the gene information within the input sequence rather than providing the complete mitochondrial chromosome as in the NCBI BLAST database. All mtSNPs with rs number entries in NCBI are integrated in the corresponding SNP in V-MitoSNP. Conclusion V-MitoSNP is a web
Full Text Available Single nucleotide polymorphisms (SNPs constitute an important mode of genetic variations observed in the human genome. A small fraction of SNPs, about four thousand out of the ten million, has been associated with genetic disorders and complex diseases. The present study focuses on SNPs that fall on protein domains, 3D structures that facilitate connectivity of proteins in cell signaling and metabolic pathways. We scanned the human proteome using the PROSITE web tool and identified proteins with SNP containing domains. We showed that SNPs that fall on protein domains are highly statistically enriched among SNPs linked to hereditary disorders and complex diseases. Proteins whose domains are dramatically altered by the presence of an SNP are even more likely to be present among proteins linked to hereditary disorders. Proteins with domain-altering SNPs comprise highly connected nodes in cellular pathways such as the focal adhesion, the axon guidance pathway and the autoimmune disease pathways. Statistical enrichment of domain/motif signatures in interacting protein pairs indicates extensive loss of connectivity of cell signaling pathways due to domain-altering SNPs, potentially leading to hereditary disorders.
Full Text Available Single-nucleotide polymorphisms (SNPs have been emerging out of the efforts to research human diseases and ethnic disparities. A semantic network is needed for in-depth understanding of the impacts of SNPs, because phenotypes are modulated by complex networks, including biochemical and physiological pathways. We identified ethnicity-specific SNPs by eliminating overlapped SNPs from HapMap samples, and the ethnicity-specific SNPs were mapped to the UCSC RefGene lists. Ethnicity-specific genes were identified as follows: 22 genes in the USA (CEU individuals, 25 genes in the Japanese (JPT individuals, and 332 genes in the African (YRI individuals. To analyze the biologically functional implications for ethnicity-specific SNPs, we focused on constructing a semantic network model. Entities for the network represented by "Gene," "Pathway," "Disease," "Chemical," "Drug," "ClinicalTrials," "SNP," and relationships between entity-entity were obtained through curation. Our semantic modeling for ethnicity-specific SNPs showed interesting results in the three categories, including three diseases ("AIDS-associated nephropathy," "Hypertension," and "Pelvic infection", one drug ("Methylphenidate", and five pathways ("Hemostasis," "Systemic lupus erythematosus," "Prostate cancer," "Hepatitis C virus," and "Rheumatoid arthritis". We found ethnicity-specific genes using the semantic modeling, and the majority of our findings was consistent with the previous studies - that an understanding of genetic variability explained ethnicity-specific disparities.
Andreassen, C N; Sørensen, Flemming Brandt; Overgaard, J
only archival specimens are available. This study was conducted to validate protocols optimised for assessment of SNPs based on paraffin embedded, formalin fixed tissue samples. PATIENTS AND METHODS: In 137 breast cancer patients, three TGFB1 SNPs were assessed based on archival histological specimens...... precipitation). RESULTS: Assessment of SNPs based on archival histological material is encumbered by a number of obstacles and pitfalls. However, these can be widely overcome by careful optimisation of the methods used for sample selection, DNA extraction and PCR. Within 130 samples that fulfil the criteria...
Han, Lee [University of Tennessee, Knoxville (UTK); Chin, Shih-Miao [ORNL; Hwang, Ho-Ling [ORNL
Along with the rapid development of Intelligent Transportation Systems (ITS), traffic data collection technologies have been evolving dramatically. The emergence of innovative data collection technologies such as Remote Traffic Microwave Sensor (RTMS), Bluetooth sensor, GPS-based Floating Car method, automated license plate recognition (ALPR) (1), etc., creates an explosion of traffic data, which brings transportation engineering into the new era of Big Data. However, despite the advance of technologies, the missing data issue is still inevitable and has posed great challenges for research such as traffic forecasting, real-time incident detection and management, dynamic route guidance, and massive evacuation optimization, because the degree of success of these endeavors depends on the timely availability of relatively complete and reasonably accurate traffic data. A thorough literature review suggests most current imputation models, if not all, focus largely on the temporal nature of the traffic data and fail to consider the fact that traffic stream characteristics at a certain location are closely related to those at neighboring locations and utilize these correlations for data imputation. To this end, this paper presents a Kriging based spatiotemporal data imputation approach that is able to fully utilize the spatiotemporal information underlying in traffic data. Imputation performance of the proposed approach was tested using simulated scenarios and achieved stable imputation accuracy. Moreover, the proposed Kriging imputation model is more flexible compared to current models.
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques. © 2013 Springer Science+Business Media New York.
Masoodi, Tariq Ahmad; Al Shammari, Sulaiman A; Al-Muammar, May N; Alhamdan, Adel A
Introduction. Apolipoprotein E (APOE) is an important risk factor for Alzheimer's disease (AD) and is present in 30-50% of patients who develop late-onset AD. Several single-nucleotide polymorphisms (SNPs) are present in APOE gene which act as the biomarkers for exploring the genetic basis of this disease. The objective of this study is to identify deleterious nsSNPs associated with APOE gene. Methods. The SNPs were retrieved from dbSNP. Using I-Mutant, protein stability change was calculated. The potentially functional nonsynonymous (ns) SNPs and their effect on protein was predicted by PolyPhen and SIFT, respectively. FASTSNP was used for functional analysis and estimation of risk score. The functional impact on the APOE protein was evaluated by using Swiss PDB viewer and NOMAD-Ref server. Results. Six nsSNPs were found to be least stable by I-Mutant 2.0 with DDG value of >-1.0. Four nsSNPs showed a highly deleterious tolerance index score of 0.00. Nine nsSNPs were found to be probably damaging with position-specific independent counts (PSICs) score of ≥2.0. Seven nsSNPs were found to be highly polymorphic with a risk score of 3-4. The total energies and root-mean-square deviation (RMSD) values were higher for three mutant-type structures compared to the native modeled structure. Conclusion. We concluded that three nsSNPs, namely, rs11542041, rs11542040, and rs11542034, to be potentially functional polymorphic.
Tariq Ahmad Masoodi
Full Text Available Introduction. Apolipoprotein E (APOE is an important risk factor for Alzheimer’s disease (AD and is present in 30–50% of patients who develop late-onset AD. Several single-nucleotide polymorphisms (SNPs are present in APOE gene which act as the biomarkers for exploring the genetic basis of this disease. The objective of this study is to identify deleterious nsSNPs associated with APOE gene. Methods. The SNPs were retrieved from dbSNP. Using I-Mutant, protein stability change was calculated. The potentially functional nonsynonymous (ns SNPs and their effect on protein was predicted by PolyPhen and SIFT, respectively. FASTSNP was used for functional analysis and estimation of risk score. The functional impact on the APOE protein was evaluated by using Swiss PDB viewer and NOMAD-Ref server. Results. Six nsSNPs were found to be least stable by I-Mutant 2.0 with DDG value of >−1.0. Four nsSNPs showed a highly deleterious tolerance index score of 0.00. Nine nsSNPs were found to be probably damaging with position-specific independent counts (PSICs score of ≥2.0. Seven nsSNPs were found to be highly polymorphic with a risk score of 3-4. The total energies and root-mean-square deviation (RMSD values were higher for three mutant-type structures compared to the native modeled structure. Conclusion. We concluded that three nsSNPs, namely, rs11542041, rs11542040, and rs11542034, to be potentially functional polymorphic.
Kullo Iftikhar J
Full Text Available Abstract Background We hypothesized that the frequencies of risk alleles of SNPs mediating susceptibility to cardiovascular diseases differ among populations of varying geographic origin and that population-specific selection has operated on some of these variants. Methods From the database of genome-wide association studies (GWAS, we selected 36 cardiovascular phenotypes including coronary heart disease, hypertension, and stroke, as well as related quantitative traits (eg, body mass index and plasma lipid levels. We identified 292 SNPs in 270 genes associated with a disease or trait at P -8. As part of the Human Genome-Diversity Project (HGDP, 158 (54.1% of these SNPs have been genotyped in 938 individuals belonging to 52 populations from seven geographic areas. A measure of population differentiation, FST, was calculated to quantify differences in risk allele frequencies (RAFs among populations and geographic areas. Results Large differences in RAFs were noted in populations of Africa, East Asia, America and Oceania, when compared with other geographic regions. The mean global FST (0.1042 for 158 SNPs among the populations was not significantly higher than the mean global FST of 158 autosomal SNPs randomly sampled from the HGDP database. Significantly higher global FST (P FST of 2036 putatively neutral SNPs. For four of these SNPs, additional evidence of selection was noted based on the integrated Haplotype Score. Conclusion Large differences in RAFs for a set of common SNPs that influence risk of cardiovascular disease were noted between the major world populations. Pairwise comparisons revealed RAF differences for at least eight SNPs that might be due to population-specific selection or demographic factors. These findings are relevant to a better understanding of geographic variation in the prevalence of cardiovascular disease.
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Cui, Jonathan J; Basques, Bryce A; Albert, Todd J; Grauer, Jonathan N
The presence of missing data is a limitation of large datasets, including the National Surgical Quality Improvement Program (NSQIP). In addressing this issue, most studies utilize complete case analysis, which excludes cases with missing data, thus potentially introducing selection bias. Multiple imputation, a statistically rigorous approach that approximates missing data and preserves sample size, may be an improvement over complete case analysis. To evaluate the impact of using multiple imputation in comparison to complete case analysis for assessing the associations between preoperative laboratory values and adverse outcomes following anterior cervical discectomy and fusion (ACDF) procedures. Retrospective review of prospectively collected data PATIENT SAMPLE: Patients undergoing one-level ACDF were identified in NSQIP 2012-2015. Perioperative adverse outcome variables assessed included the occurrence of any adverse event, severe adverse events, and hospital readmission. Missing preoperative albumin and hematocrit values were handled using complete case analysis and multiple imputation. These preoperative laboratory levels were then tested for associations with 30-day postoperative outcomes using logistic regression. A total of 11,999 patients were included. Of this cohort, 63.5% of patients were missing preoperative albumin and 9.9% were missing preoperative hematocrit. When utilizing complete case analysis, only 4,311 patients were studied. The removed patients were significantly younger, healthier, of a common BMI and male. Logistic regression analysis failed to identify either preoperative hypoalbuminemia or preoperative anemia as significantly associated with adverse outcomes. When employing multiple imputation, all 11,999 patients were included. Preoperative hypoalbuminemia was significantly associated with the occurrence of any adverse event and severe adverse events. Preoperative anemia was significantly associated with the occurrence of any adverse
Wahl, Simone; Boulesteix, Anne-Laure; Zierer, Astrid; Thorand, Barbara; van de Wiel, Mark A
Missing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation. In a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data. Internal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by
Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P; O'Connell, Dianne L
The multiple imputation approach to missing data has been validated by a number of simulation studies by artificially inducing missingness on fully observed stage data under a pre-specified missing data mechanism. However, the validity of multiple imputation has not yet been assessed using real data. The objective of this study was to assess the validity of using multiple imputation for "unknown" prostate cancer stage recorded in the New South Wales Cancer Registry (NSWCR) in real-world conditions. Data from the population-based cohort study NSW Prostate Cancer Care and Outcomes Study (PCOS) were linked to 2000-2002 NSWCR data. For cases with "unknown" NSWCR stage, PCOS-stage was extracted from clinical notes. Logistic regression was used to evaluate the missing at random assumption adjusted for variables from two imputation models: a basic model including NSWCR variables only and an enhanced model including the same NSWCR variables together with PCOS primary treatment. Cox regression was used to evaluate the performance of MI. Of the 1864 prostate cancer cases 32.7% were recorded as having "unknown" NSWCR stage. The missing at random assumption was satisfied when the logistic regression included the variables included in the enhanced model, but not those in the basic model only. The Cox models using data with imputed stage from either imputation model provided generally similar estimated hazard ratios but with wider confidence intervals compared with those derived from analysis of the data with PCOS-stage. However, the complete-case analysis of the data provided a considerably higher estimated hazard ratio for the low socio-economic status group and rural areas in comparison with those obtained from all other datasets. Using MI to deal with "unknown" stage data recorded in a population-based cancer registry appears to provide valid estimates. We would recommend a cautious approach to the use of this method elsewhere.
Yoke, Chin Wan; Khalid, Zarina Mohd
Along a continual process of collecting data, missing recorded datum always a main problem faced by the real application. It happens due to the carelessness or the unawareness of a recorder to the importance of data documentation. In this study, a random-effects analysis which simulates data from a proposed algorithm is presented with a missing covariate. It is an improved simulation method which involves first-order autoregressive (AR(1)) process in measuring the correlation between measurements of a subject across two time sequence. Complete-case analysis and multiple imputation method are comparatively implemented for the estimation procedure. This study shows that the multiple imputation method results in estimations which fit well to the data which are not only missing completely at random (MCAR) but also missing at random (MAR). However, the complete-case analysis results in estimators which fit well to the data which are only MCAR.
Genomic prediction has been widely used in dairy cattle breeding. Genotype imputation is a key procedure to efficently utilize marker data from different chips and obtain high density marker data with minimizing cost. This thesis investigated methods and strategies to genotype imputation...... for improving genomic prediction. The results indicate the IMPUTE2 and Beagle are accurate imputation methods, while Fimpute is a good alternative for routine imputation with large data set. Genotypes of non-genotyped animals can be accurately imputed if they have genotyped porgenies. A combined reference...
Gláucia Tatiana Ferrari
Full Text Available Time series from weather stations in Brazil have several missing data, outliers and spurious zeroes. In order to use this dataset in risk and meteorological studies, one should take into account alternative methodologies to deal with these problems. This article describes the statistical imputation and quality control procedures applied to a database of daily precipitation from meteorological stations located in the State of Parana, Brazil. After imputation, the data went through a process of quality control to identify possible errors, such as: identical precipitation over seven consecutive days and precipitation values that differ significantly from the values in neighboring weather stations. Next, we used the extreme value theory to model agricultural drought, considering the maximum number of consecutive days with precipitation below 7 mm for the period between January and February, in the main soybean agricultural regions in the State of Parana.
Full Text Available Abstract Background Polymorphism in genes of regulating enzymes, transporters and receptors of the neurotransmitters of the central nervous system have been associated with altered behaviour, and single nucleotide polymorphisms (SNPs represent the most frequent type of genetic variation. The serotonin and dopamine signalling systems have a central influence on different behavioural phenotypes, both of invertebrates and vertebrates, and this study was undertaken in order to explore genetic variation that may be associated with variation in behaviour. Results Single nucleotide polymorphisms in canine genes related to behaviour were identified by individually sequencing eight dogs (Canis familiaris of different breeds. Eighteen genes from the dopamine and the serotonin systems were screened, revealing 34 SNPs distributed in 14 of the 18 selected genes. A total of 24,895 bp coding sequence was sequenced yielding an average frequency of one SNP per 732 bp (1/732. A total of 11 non-synonymous SNPs (nsSNPs, which may be involved in alteration of protein function, were detected. Of these 11 nsSNPs, six resulted in a substitution of amino acid residue with concomitant change in structural parameters. Conclusion We have identified a number of coding SNPs in behaviour-related genes, several of which change the amino acids of the proteins. Some of the canine SNPs exist in codons that are evolutionary conserved between five compared species, and predictions indicate that they may have a functional effect on the protein. The reported coding SNP frequency of the studied genes falls within the range of SNP frequencies reported earlier in the dog and other mammalian species. Novel SNPs are presented and the results show a significant genetic variation in expressed sequences in this group of genes. The results can contribute to an improved understanding of the genetics of behaviour.
Robbins Michael W.
Full Text Available Missing values present a prevalent problem in the analysis of establishment survey data. Multivariate imputation algorithms (which are used to fill in missing observations tend to have the common limitation that imputations for continuous variables are sampled from Gaussian distributions. This limitation is addressed here through the use of robust marginal transformations. Specifically, kernel-density and empirical distribution-type transformations are discussed and are shown to have favorable properties when used for imputation of complex survey data. Although such techniques have wide applicability (i.e., they may be easily applied in conjunction with a wide array of imputation techniques, the proposed methodology is applied here with an algorithm for imputation in the USDA’s Agricultural Resource Management Survey. Data analysis and simulation results are used to illustrate the specific advantages of the robust methods when compared to the fully parametric techniques and to other relevant techniques such as predictive mean matching. To summarize, transformations based upon parametric densities are shown to distort several data characteristics in circumstances where the parametric model is ill fit; however, no circumstances are found in which the transformations based upon parametric models outperform the nonparametric transformations. As a result, the transformation based upon the empirical distribution (which is the most computationally efficient is recommended over the other transformation procedures in practice.
Hsin Y. Tsai
Full Text Available Understanding the relationship between genetic variants and traits of economic importance in aquaculture species is pertinent to selective breeding programmes. High-throughput sequencing technologies have enabled the discovery of large numbers of SNPs in Atlantic salmon, and high density SNP arrays now exist. A previous genome-wide association study (GWAS using a high density SNP array (132K SNPs has revealed the polygenic nature of early growth traits in salmon, but has also identified candidate SNPs showing suggestive associations with these traits. The aim of this study was to test the association of the candidate growth-associated SNPs in a separate population of farmed Atlantic salmon to verify their effects. Identifying SNP-trait associations in two populations provides evidence that the associations are true and robust. Using a large cohort (N = 1152, we successfully genotyped eight candidate SNPs from the previous GWAS, two of which were significantly associated with several growth and fillet traits measured at harvest. The genes proximal to these SNPs were identified by alignment to the salmon reference genome and are discussed in the context of their potential role in underpinning genetic variation in salmon growth.
Kim, Kyoung-Nam; Lee, Mee-Ri; Lim, Youn-Hee; Hong, Yun-Chul
Homocysteine has been causally associated with various adverse health outcomes. Evidence supporting the relationship between lead and homocysteine levels has been accumulating, but most prior studies have not focused on the interaction with genetic polymorphisms. From a community-based prospective cohort, we analysed 386 participants (aged 41-71 years) with information regarding blood lead and plasma homocysteine levels. Blood lead levels were measured between 2001 and 2003, and plasma homocysteine levels were measured in 2007. Interactions of lead levels with 42 genotyped single-nucleotide polymorphisms (SNPs) in five genes ( TF , HFE , CBS , BHMT and MTR ) were assessed via a 2-degree of freedom (df) joint test and a 1-df interaction test. In secondary analyses using imputation, we further assessed 58 imputed SNPs in the TF and MTHFR genes. Blood lead concentrations were positively associated with plasma homocysteine levels (p=0.0276). Six SNPs in the TF and MTR genes were screened using the 2-df joint test, and among them, three SNPs in the TF gene showed interactions with lead with respect to homocysteine levels through the 1-df interaction test (phomocysteine levels at an α-level of 0.05, but the associations did not persist after Bonferroni correction. These SNPs did not show interactions with lead levels. Blood lead levels were positively associated with plasma homocysteine levels measured 4-6 years later, and three SNPs in the TF gene modified the association. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Full Text Available We aimed to assess whether pri-miRNA SNPs (miSNPs could influence monocyte gene expression, either through marginal association or by interacting with polymorphisms located in 3'UTR regions (3utrSNPs. We then conducted a genome-wide search for marginal miSNPs effects and pairwise miSNPs × 3utrSNPs interactions in a sample of 1,467 individuals for which genome-wide monocyte expression and genotype data were available. Statistical associations that survived multiple testing correction were tested for replication in an independent sample of 758 individuals with both monocyte gene expression and genotype data. In both studies, the hsa-mir-1279 rs1463335 was found to modulate in cis the expression of LYZ and in trans the expression of CNTN6, CTRC, COPZ2, KRT9, LRRFIP1, NOD1, PCDHA6, ST5 and TRAF3IP2 genes, supporting the role of hsa-mir-1279 as a regulator of several genes in monocytes. In addition, we identified two robust miSNPs × 3utrSNPs interactions, one involving HLA-DPB1 rs1042448 and hsa-mir-219-1 rs107822, the second the H1F0 rs1894644 and hsa-mir-659 rs5750504, modulating the expression of the associated genes.As some of the aforementioned genes have previously been reported to reside at disease-associated loci, our findings provide novel arguments supporting the hypothesis that the genetic variability of miRNAs could also contribute to the susceptibility to human diseases.
Ramus, S.J.; Vierkant, R.A.; Johnatty, S.E.
. A marginally significant association was found for RB1 when all studies were included [ordinal odds ratio (OR) 0.88 (95% confidence interval (CI) 0.79-1.00) p = 0.041 and dominant OR 0.87 (95% CI 0.76-0.98) p = 0.025]; when the studies that originally suggested an association were excluded, the result......The Ovarian Cancer Association Consortium selected 7 candidate single nucleotide polymorphisms (SNPs), for which there is evidence from previous studies of an association with variation in ovarian cancer or breast cancer risks. The SNPs selected for analysis were F31I (rs2273535) in AURKA, N372H...... (rs144848) in BRCA2, rs2854344 in intron 17 of RB1, rs2811712 5' flanking CDKN2A, rs523349 in the 3' UTR of SRD5A2, D302H (rs1045485) in CASP8 and L10P (rs1982073) in TGFB1. Fourteen studies genotyped 4,624 invasive epithelial ovarian cancer cases and 8,113 controls of white non-Hispanic origin...
Brittni N. Frederiksen
Full Text Available Previously, we examined 20 non-HLA SNPs for association with islet autoimmunity (IA and/or progression to type 1 diabetes (T1D. Our objective was to investigate fourteen additional non-HLA T1D candidate SNPs for stage- and age-related heterogeneity in the etiology of T1D. Of 1634 non-Hispanic white DAISY children genotyped, 132 developed IA (positive for GAD, insulin, or IA-2 autoantibodies at two or more consecutive visits; 50 IA positive children progressed to T1D. Cox regression was used to analyze risk of IA and progression to T1D in IA positive children. Restricted cubic splines were used to model SNPs when there was evidence that risk was not constant with age. C1QTNF6 (rs229541 predicted increased IA risk (HR: 1.57, CI: 1.20–2.05 but not progression to T1D (HR: 1.13, CI: 0.75–1.71. SNP (rs10517086 appears to exhibit an age-related effect on risk of IA, with increased risk before age 2 years (age 2 HR: 1.67, CI: 1.08–2.56 but not older ages (age 4 HR: 0.84, CI: 0.43–1.62. C1QTNF6 (rs229541, SNP (rs10517086, and UBASH3A (rs3788013 were associated with development of T1D. This prospective investigation of non-HLA T1D candidate loci shows that some SNPs may exhibit stage- and age-related heterogeneity in the etiology of T1D.
Frederiksen, Brittni N.; Steck, Andrea K.; Lamb, Molly M.; Rewers, Marian; Norris, Jill M.
Previously, we examined 20 non-HLA SNPs for association with islet autoimmunity (IA) and/or progression to type 1 diabetes (T1D). Our objective was to investigate fourteen additional non-HLA T1D candidate SNPs for stage- and age-related heterogeneity in the etiology of T1D. Of 1634 non-Hispanic white DAISY children genotyped, 132 developed IA (positive for GAD, insulin, or IA-2 autoantibodies at two or more consecutive visits); 50 IA positive children progressed to T1D. Cox regression was used to analyze risk of IA and progression to T1D in IA positive children. Restricted cubic splines were used to model SNPs when there was evidence that risk was not constant with age. C1QTNF6 (rs229541) predicted increased IA risk (HR: 1.57, CI: 1.20–2.05) but not progression to T1D (HR: 1.13, CI: 0.75–1.71). SNP (rs10517086) appears to exhibit an age-related effect on risk of IA, with increased risk before age 2 years (age 2 HR: 1.67, CI: 1.08–2.56) but not older ages (age 4 HR: 0.84, CI: 0.43–1.62). C1QTNF6 (rs229541), SNP (rs10517086), and UBASH3A (rs3788013) were associated with development of T1D. This prospective investigation of non-HLA T1D candidate loci shows that some SNPs may exhibit stage- and age-related heterogeneity in the etiology of T1D. PMID:24367383
Full Text Available Norman E Buroker,1 Xue-Han Ning,1,2,† Zhao-Nian Zhou,3 Kui Li,4 Wei-Jun Cen,4 Xiu-Feng Wu,3 Wei-Zhong Zhu,5 C Ronald Scott,1 Shi-Han Chen1 1Department of Pediatrics, University of Washington, 2Division of Cardiology, Seattle Children’s Hospital Research Foundation, Seattle, WA, USA; 3Laboratory of Hypoxia Physiology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China; 4Lhasa People Hospital, Lhasa, Tibet; 5Center for Cardiovascular Biology and Regenerative Medicine, University of Washington, Seattle, WA, USA †Xue-Han Ning passed away on April 20, 2015 Abstract: Chronic mountain sickness (CMS is estimated at 1.2% in Tibetans living at the Qinghai–Tibetan Plateau. Eighteen single-nucleotide polymorphisms (SNPs from nine nuclear genes that have an association with CMS in Tibetans have been analyzed by using pairwise linkage disequilibrium (LD. The SNPs included are the angiotensin-converting enzyme (rs4340, the angiotensinogen (rs699, and the angiotensin II type 1 receptor (AGTR1 (rs5186 from the renin–angiotensin system. A low-density lipoprotein apolipoprotein B (rs693 SNP was also included. From the hypoxia-inducible factor oxygen signaling pathway, the endothetal Per-Arnt-Sim domain protein 1 (EPAS1 and the egl nine homolog 1 (ENGL1 (rs480902 SNPs were included in the study. SNPs from the vascular endothelial growth factor (VEGF signaling pathway included are the v-akt murine thymoma viral oncogene homolog 3 (rs4590656 and rs2291409, the endothelial cell nitric oxide synthase 3 (rs1007311 and rs1799983, and the (VEGFA (rs699947, rs34357231, rs79469752, rs13207351, rs28357093, rs1570360, rs2010963, and rs3025039. An increase in LD occurred in 40 pairwise comparisons, whereas a decrease in LD was found in 55 pairwise comparisons between the controls and CMS patients. These changes were found to occur within and between signaling pathways, which suggests that there is an interaction between SNP
Minica, C.C.; Dolan, C.V.; Willemsen, G.; Vink, J.M.; Boomsma, D.I.
When phenotypic, but no genotypic data are available for relatives of participants in genetic association studies, previous research has shown that family-based imputed genotypes can boost the statistical power when included in such studies. Here, using simulations, we compared the performance of
Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3-40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31-0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04-0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.
Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches , and (2) retrieving-based approaches . Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set . The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages . This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead . © 2016 IEEE.
Siddique, Juned; Harel, Ofer; Crespi, Catherine M.
We present a framework for generating multiple imputations for continuous data when the missing data mechanism is unknown. Imputations are generated from more than one imputation model in order to incorporate uncertainty regarding the missing data mechanism. Parameter estimates based on the different imputation models are combined using rules for nested multiple imputation. Through the use of simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation infere...
Wallace, Meredith L; Anderson, Stewart J; Mazumdar, Sati
Missing covariate data present a challenge to tree-structured methodology due to the fact that a single tree model, as opposed to an estimated parameter value, may be desired for use in a clinical setting. To address this problem, we suggest a multiple imputation algorithm that adds draws of stochastic error to a tree-based single imputation method presented by Conversano and Siciliano (Technical Report, University of Naples, 2003). Unlike previously proposed techniques for accommodating missing covariate data in tree-structured analyses, our methodology allows the modeling of complex and nonlinear covariate structures while still resulting in a single tree model. We perform a simulation study to evaluate our stochastic multiple imputation algorithm when covariate data are missing at random and compare it to other currently used methods. Our algorithm is advantageous for identifying the true underlying covariate structure when complex data and larger percentages of missing covariate observations are present. It is competitive with other current methods with respect to prediction accuracy. To illustrate our algorithm, we create a tree-structured survival model for predicting time to treatment response in older, depressed adults. Copyright © 2010 John Wiley & Sons, Ltd.
Tilling, Kate; Williamson, Elizabeth J; Spratt, Michael; Sterne, Jonathan A C; Carpenter, James R
Missing data are a pervasive problem, often leading to bias in complete records analysis (CRA). Multiple imputation (MI) via chained equations is one solution, but its use in the presence of interactions is not straightforward. We simulated data with outcome Y dependent on binary explanatory variables X and Z and their interaction XZ. Six scenarios were simulated (Y continuous and binary, each with no interaction, a weak and a strong interaction), under five missing data mechanisms. We use directed acyclic graphs to identify when CRA and MI would each be unbiased. We evaluate the performance of CRA, MI without interactions, MI including all interactions, and stratified imputation. We also illustrated these methods using a simple example from the National Child Development Study (NCDS). MI excluding interactions is invalid and resulted in biased estimates and low coverage. When XZ was zero, MI excluding interactions gave unbiased estimates but overcoverage. MI including interactions and stratified MI gave equivalent, valid inference in all cases. In the NCDS example, MI excluding interactions incorrectly concluded there was no evidence for an important interaction. Epidemiologists carrying out MI should ensure that their imputation model(s) are compatible with their analysis model. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.
...) General Principles Relating to Suspension and Debarment Actions § 919.630 May the OPM impute conduct of one person to another? For purposes of actions taken under this rule, we may impute conduct as follows...
This report presents findings from the 2006 National Census of Ferry Operators (NCFO) augmented with imputed values for passengers and passenger miles. Due to the imputation procedures used to calculate missing data, totals in Table 1 may not corresp...
Full Text Available Abstract Background Single nucleotide polymorphisms (SNPs are the most common source of genetic variation in eukaryotic species and have become an important marker for genetic studies. The mosquito Anopheles funestus is one of the major malaria vectors in Africa and yet, prior to this study, no SNPs have been described for this species. Here we report a genome-wide set of SNP markers for use in genetic studies on this important human disease vector. Results DNA fragments from 50 genes were amplified and sequenced from 21 specimens of An. funestus. A third of specimens were field collected in Malawi, a third from a colony of Mozambican origin and a third form a colony of Angolan origin. A total of 494 SNPs including 303 within the coding regions of genes and 5 indels were identified. The physical positions of these SNPs in the genome are known. There were on average 7 SNPs per kilobase similar to that observed in An. gambiae and Drosophila melanogaster. Transitions outnumbered transversions, at a ratio of 2:1. The increased frequency of transition substitutions in coding regions is likely due to the structure of the genetic code and selective constraints. Synonymous sites within coding regions showed a higher polymorphism rate than non-coding introns or 3' and 5'flanking DNA with most of the substitutions in coding regions being observed at the 3rd codon position. A positive correlation in the level of polymorphism was observed between coding and non-coding regions within a gene. By genotyping a subset of 30 SNPs, we confirmed the validity of the SNPs identified during this study. Conclusion This set of SNP markers represents a useful tool for genetic studies in An. funestus, and will be useful in identifying candidate genes that affect diverse ranges of phenotypes that impact on vector control, such as resistance insecticide, mosquito behavior and vector competence.
Eekhout, Iris; de Vet, Henrica C. W.; Twisk, Jos W. R.; Brand, Jaap P. L.; de Boer, Michiel R.; Heymans, Martijn W.
Objectives: Regardless of the proportion of missing values, complete-case analysis is most frequently applied, although advanced techniques such as multiple imputation (MI) are available. The objective of this study was to explore the performance of simple and more advanced methods for handling
Full Text Available BACKGROUND: The high levels of variation characterising the mitochondrial DNA (mtDNA molecule are due ultimately to its high average mutation rate; moreover, mtDNA variation is deeply structured in different populations and ethnic groups. There is growing interest in selecting a reduced number of mtDNA single nucleotide polymorphisms (mtSNPs that account for the maximum level of discrimination power in a given population. Applications of the selected mtSNP panel range from anthropologic and medical studies to forensic genetic casework. METHODOLOGY/PRINCIPAL FINDINGS: This study proposes a new simulation-based method that explores the ability of different mtSNP panels to yield the maximum levels of discrimination power. The method explores subsets of mtSNPs of different sizes randomly chosen from a preselected panel of mtSNPs based on frequency. More than 2,000 complete genomes representing three main continental human population groups (Africa, Europe, and Asia and two admixed populations ("African-Americans" and "Hispanics" were collected from GenBank and the literature, and were used as training sets. Haplotype diversity was measured for each combination of mtSNP and compared with existing mtSNP panels available in the literature. The data indicates that only a reduced number of mtSNPs ranging from six to 22 are needed to account for 95% of the maximum haplotype diversity of a given population sample. However, only a small proportion of the best mtSNPs are shared between populations, indicating that there is not a perfect set of "universal" mtSNPs suitable for all population contexts. The discrimination power provided by these mtSNPs is much higher than the power of the mtSNP panels proposed in the literature to date. Some mtSNP combinations also yield high diversity values in admixed populations. CONCLUSIONS/SIGNIFICANCE: The proposed computational approach for exploring combinations of mtSNPs that optimise the discrimination power of a given
Salas, Antonio; Amigo, Jorge
The high levels of variation characterising the mitochondrial DNA (mtDNA) molecule are due ultimately to its high average mutation rate; moreover, mtDNA variation is deeply structured in different populations and ethnic groups. There is growing interest in selecting a reduced number of mtDNA single nucleotide polymorphisms (mtSNPs) that account for the maximum level of discrimination power in a given population. Applications of the selected mtSNP panel range from anthropologic and medical studies to forensic genetic casework. This study proposes a new simulation-based method that explores the ability of different mtSNP panels to yield the maximum levels of discrimination power. The method explores subsets of mtSNPs of different sizes randomly chosen from a preselected panel of mtSNPs based on frequency. More than 2,000 complete genomes representing three main continental human population groups (Africa, Europe, and Asia) and two admixed populations ("African-Americans" and "Hispanics") were collected from GenBank and the literature, and were used as training sets. Haplotype diversity was measured for each combination of mtSNP and compared with existing mtSNP panels available in the literature. The data indicates that only a reduced number of mtSNPs ranging from six to 22 are needed to account for 95% of the maximum haplotype diversity of a given population sample. However, only a small proportion of the best mtSNPs are shared between populations, indicating that there is not a perfect set of "universal" mtSNPs suitable for all population contexts. The discrimination power provided by these mtSNPs is much higher than the power of the mtSNP panels proposed in the literature to date. Some mtSNP combinations also yield high diversity values in admixed populations. The proposed computational approach for exploring combinations of mtSNPs that optimise the discrimination power of a given set of mtSNPs is more efficient than previous empirical approaches. In contrast to
Brand, Jaap P.L.; van Buuren, Stef; Groothuis-Oudshoorn, Karin; Gelsema, Edzard S.
This paper outlines a strategy to validate multiple imputation methods. Rubin's criteria for proper multiple imputation are the point of departure. We describe a simulation method that yields insight into various aspects of bias and efficiency of the imputation process. We propose a new method for
van Ginkel, Joost R.; van der Ark, L. Andries; Sijtsma, Klaas
The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at…
Gottschall, Amanda C.; West, Stephen G.; Enders, Craig K.
Behavioral science researchers routinely use scale scores that sum or average a set of questionnaire items to address their substantive questions. A researcher applying multiple imputation to incomplete questionnaire data can either impute the incomplete items prior to computing scale scores or impute the scale scores directly from other scale…
Brand, J.P.L.; Buuren, S. van; Groothuis-Oudshoorn, K.; Gelsema, E.S.
This paper outlines a strategy to validate multiple imputation methods. Rubin's criteria for proper multiple imputation are the point of departure. We describe a simulation method that yields insight into various aspects of bias and efficiency of the imputation process. We propose a new method for
Stanley J. Zarnoch; H. Ken Cordell; Carter J. Betz; John C. Bergstrom
Multiple imputation is used to create values for missing family income data in the National Survey on Recreation and the Environment. We present an overview of the survey and a description of the missingness pattern for family income and other key variables. We create a logistic model for the multiple imputation process and to impute data sets for family income. We...
Full Text Available Abstract Background Multiple imputation is becoming increasingly popular. Theoretical considerations as well as simulation studies have shown that the inclusion of auxiliary variables is generally of benefit. Methods A simulation study of a linear regression with a response Y and two predictors X1 and X2 was performed on data with n = 50, 100 and 200 using complete cases or multiple imputation with 0, 10, 20, 40 and 80 auxiliary variables. Mechanisms of missingness were either 100% MCAR or 50% MAR + 50% MCAR. Auxiliary variables had low (r=.10 vs. moderate correlations (r=.50 with X’s and Y. Results The inclusion of auxiliary variables can improve a multiple imputation model. However, inclusion of too many variables leads to downward bias of regression coefficients and decreases precision. When the correlations are low, inclusion of auxiliary variables is not useful. Conclusion More research on auxiliary variables in multiple imputation should be performed. A preliminary rule of thumb could be that the ratio of variables to cases with complete data should not go below 1 : 3.
Hara, Kazuo; Fujita, Hayato; Johnson, Todd A
genotyped or imputed using East Asian references from the 1000 Genomes Project (June 2011 release) in 5976 Japanese patients with T2D and 20 829 nondiabetic individuals. Nineteen unreported loci were selected and taken forward to follow-up analyses. Combined discovery and follow-up analyses (30 392 cases...... (rs312457; risk allele = G; RAF = 0.078; P = 7.69 × 10(-13); OR = 1.20). This study demonstrates that GWASs based on the imputation of genotypes using modern reference haplotypes such as that from the 1000 Genomes Project data can assist in identification of new loci for common diseases.......Although over 60 loci for type 2 diabetes (T2D) have been identified, there still remains a large genetic component to be clarified. To explore unidentified loci for T2D, we performed a genome-wide association study (GWAS) of 6 209 637 single-nucleotide polymorphisms (SNPs), which were directly...
Enders, Craig K.; Gottschall, Amanda C.
Although structural equation modeling software packages use maximum likelihood estimation by default, there are situations where one might prefer to use multiple imputation to handle missing data rather than maximum likelihood estimation (e.g., when incorporating auxiliary variables). The selection of variables is one of the nuances associated…
Storey, Philip; Murchison, Ann P; Dai, Yang; Hark, Lisa; Pizzi, Laura T; Leiby, Benjamin E; Haller, Julia A
To compare methodologies for imputing ethnicity in an urban ophthalmology clinic. Using data from 19,165 patients with self-reported ethnicity, surname, and home address, we compared the accuracy of three methodologies for imputing ethnicity: (1) a surname method based on tabulation from the 2000 US Census; (2) a geocoding method based on tract data from the 2010 US Census; and (3) a combined surname geocoding method using Bayes' theorem. The combined surname geocoding model had the highest accuracy of the three methodologies, imputing black ethnicity with a sensitivity of 84% and positive predictive value (PPV) of 94%, white ethnicity with a sensitivity of 92% and PPV of 82%, Hispanic ethnicity with a sensitivity of 77% and PPV of 71%, and Asian ethnicity with a sensitivity of 83% and PPV of 79%. Overall agreement of imputed and self-reported ethnicity was fair for the surname method (κ 0.23), moderate for the geocoding method (κ 0.58), and strong for the combined method (κ 0.76). A methodology combining surname analysis and Census tract data using Bayes' theorem to determine ethnicity is superior to other methods tested and is ideally suited for research purposes of clinical and administrative data.
Full Text Available Abstract Background Standard mean imputation for missing values in the Western Ontario and Mc Master (WOMAC Osteoarthritis Index limits the use of collected data and may lead to bias. Probability model-based imputation methods overcome such limitations but were never before applied to the WOMAC. In this study, we compare imputation results for the Expectation Maximization method (EM and the mean imputation method for WOMAC in a cohort of total hip replacement patients. Methods WOMAC data on a consecutive cohort of 2062 patients scheduled for surgery were analyzed. Rates of missing values in each of the WOMAC items from this large cohort were used to create missing patterns in the subset of patients with complete data. EM and the WOMAC's method of imputation are then applied to fill the missing values. Summary score statistics for both methods are then described through box-plot and contrasted with the complete case (CC analysis and the true score (TS. This process is repeated using a smaller sample size of 200 randomly drawn patients with higher missing rate (5 times the rates of missing values observed in the 2062 patients capped at 45%. Results Rate of missing values per item ranged from 2.9% to 14.5% and 1339 patients had complete data. Probability model-based EM imputed a score for all subjects while WOMAC's imputation method did not. Mean subscale scores were very similar for both imputation methods and were similar to the true score; however, the EM method results were more consistent with the TS after simulation. This difference became more pronounced as the number of items in a subscale increased and the sample size decreased. Conclusions The EM method provides a better alternative to the WOMAC imputation method. The EM method is more accurate and imputes data to create a complete data set. These features are very valuable for patient-reported outcomes research in which resources are limited and the WOMAC score is used in a multivariate
Ghodke, Yogita; Chopra, Arvind; Shintre, Pooja; Puranik, Amrutesh; Joshi, Kalpana; Patwardhan, Bhushan
Many pharmacologically-relevant polymorphisms show variability among different populations. Though limited, data from Caucasian subjects have reported several single nucleotide polymorphism (SNPs) in folate biosynthetic pathway. These SNPs may be subjected to racial and ethnic differences. We carried out a study to determine the allelic frequencies of these SNPs in an Indian ethnic population. Whole blood samples were withdrawn from 144 unrelated healthy subjects from west India. DNA was extracted and genotyping was performed using PCR-RFLP and Real-time Taqman allelic discrimination for 12 polymorphisms in 9 genes of folate-methotrexate (MTX) metabolism. Allele frequencies were obtained for MTHFR 677T (10%) and 1298 C (30%), TS 3UTR 0bp (46%), MDR1 3435T and 1236T (62%), RFC1 80A (57%), GGH 401T (61%), MS 2756G (34%), ATIC 347G (52%) and SHMT1 1420T (80%) in healthy subjects (frequency of underlined SNPs were different from published study data of European and African populations). The current study describes the distribution of folate biosynthetic pathway SNPs in healthy Indians and validates the previous finding of differences due to race and ethnicity. Our results pave way to study the pharmacogenomics of MTX in the Indian population.
Sarkar Roy, N; Farheen, S; Roy, N; Sengupta, S; Majumder, P P
Isolated population groups are useful in conducting association studies of complex diseases to avoid various pitfalls, including those arising from population stratification. Since DNA resequencing is expensive, it is recommended that genotyping be carried out at tagSNP (tSNP) loci. For this, tSNPs identified in one isolated population need to be used in another. Unless tSNPs are highly portable across populations this strategy may result in loss of information in association studies. We examined the issue of tSNP portability by sampling individuals from 10 isolated ethnic groups from India. We generated DNA resequencing data pertaining to 3 genomic regions and identified tSNPs in each population. We defined an index of tSNP portability and showed that portability is low across isolated Indian ethnic groups. The extent of portability did not significantly correlate with genetic similarity among the populations studied here. We also analyzed our data with sequence data from individuals of African and European descent. Our results indicated that it may be necessary to carry out resequencing in a small number of individuals to discover SNPs and identify tSNPs in the specific isolated population in which a disease association study is to be conducted.
Tomas, Carmen; Sanchez, Juan Jose; Castro, J.A.
(SNPs) in relationship testing have been published. We selected 25 highly polymorphic biallelic SNPs distributed through the human X-chromosome. One 25-plex PCR reaction and one 25-plex single base extension (SNaPshot) reaction were developed. The maximum size of the PCR products was 120ábp and the size...
In order to reveal the single nucleotide polymorphisms (SNPs), genotypes and allelic frequencies of each mutation site of TLR7 gene in Chinese native duck breeds, SNPs of duck TLR7 gene were detected by DNA sequencing. The genotypes of 465 native ducks from eight key protected duck breeds were determined by ...
Liu, Xiaonan; Zhang, Chao; Liu, Kewu; Wang, Han; Lu, Chaoxia; Li, Hang; Hua, Kai; Zhu, Juanli; Hui, Wenli; Cui, Yali; Zhang, Xue
Single nucleotide polymorphisms (SNPs) are closely related to genetic diseases, but current SNP detection methods, such as DNA microarrays that include tedious procedures and expensive, sophisticated instruments, are unable to perform rapid SNPs detection in clinical practice, especially for those multiple SNPs related to genetic diseases. In this study, we report a sensitive, low cost, and easy-to-use point-of-care testing (POCT) system formed by combining amplification refractory mutation system (ARMS) polymerase chain reaction with gold magnetic nanoparticles (GMNPs) and lateral flow assay (LFA) noted as the ARMS-LFA system, which allow us to use a uniform condition for multiple SNPs detection simultaneously. The genotyping results can be explained by a magnetic reader automatically or through visual interpretation according to the captured GMNPs probes on the test and control lines of the LFA device. The high sensitivity (the detection limit of 0.04 pg/μL with plasmid) and specificity of this testing system were found through genotyping seven pathogenic SNPs in phenylalanine hydroxylase gene (PAH, the etiological factor of phenylketonuria). This system can also be applied in DNA quantification with a linear range from 0.02 to 2 pg/μL of plasmid. Furthermore, this ARMS-LFA system was applied to clinical trials for screening the seven pathogenic SNPs in PAH of 23 families including 69 individuals. The concordance rate of the genotyping results detected by the ARMS-LFA system was up to 97.8% compared with the DNA sequencing results. This method is a very promising POCT in the detection of multiple SNPs caused by genetic diseases.
Eekhout, Iris; van de Wiel, Mark A; Heymans, Martijn W
Multiple imputation is a recommended method to handle missing data. For significance testing after multiple imputation, Rubin's Rules (RR) are easily applied to pool parameter estimates. In a logistic regression model, to consider whether a categorical covariate with more than two levels significantly contributes to the model, different methods are available. For example pooling chi-square tests with multiple degrees of freedom, pooling likelihood ratio test statistics, and pooling based on the covariance matrix of the regression model. These methods are more complex than RR and are not available in all mainstream statistical software packages. In addition, they do not always obtain optimal power levels. We argue that the median of the p-values from the overall significance tests from the analyses on the imputed datasets can be used as an alternative pooling rule for categorical variables. The aim of the current study is to compare different methods to test a categorical variable for significance after multiple imputation on applicability and power. In a large simulation study, we demonstrated the control of the type I error and power levels of different pooling methods for categorical variables. This simulation study showed that for non-significant categorical covariates the type I error is controlled and the statistical power of the median pooling rule was at least equal to current multiple parameter tests. An empirical data example showed similar results. It can therefore be concluded that using the median of the p-values from the imputed data analyses is an attractive and easy to use alternative method for significance testing of categorical variables.
Full Text Available Abstract Missing genotypes are a common feature of high density SNP datasets obtained using SNP chip technology and this is likely to decrease the accuracy of genomic selection. This problem can be circumvented by imputing the missing genotypes with estimated genotypes. When implementing imputation, the criteria used for SNP data quality control and whether to perform imputation before or after data quality control need to consider. In this paper, we compared six strategies of imputation and quality control using different imputation methods, different quality control criteria and by changing the order of imputation and quality control, against a real dataset of milk production traits in Chinese Holstein cattle. The results demonstrated that, no matter what imputation method and quality control criteria were used, strategies with imputation before quality control performed better than strategies with imputation after quality control in terms of accuracy of genomic selection. The different imputation methods and quality control criteria did not significantly influence the accuracy of genomic selection. We concluded that performing imputation before quality control could increase the accuracy of genomic selection, especially when the rate of missing genotypes is high and the reference population is small.
Control-based pattern mixture models (PMM) and delta-adjusted PMMs are commonly used as sensitivity analyses in clinical trials with non-ignorable dropout. These PMMs assume that the statistical behavior of outcomes varies by pattern in the experimental arm in the imputation procedure, but the imputed data are typically analyzed by a standard method such as the primary analysis model. In the multiple imputation (MI) inference, Rubin's variance estimator is generally biased when the imputation and analysis models are uncongenial. One objective of the article is to quantify the bias of Rubin's variance estimator in the control-based and delta-adjusted PMMs for longitudinal continuous outcomes. These PMMs assume the same observed data distribution as the mixed effects model for repeated measures (MMRM). We derive analytic expressions for the MI treatment effect estimator and the associated Rubin's variance in these PMMs and MMRM as functions of the maximum likelihood estimator from the MMRM analysis and the observed proportion of subjects in each dropout pattern when the number of imputations is infinite. The asymptotic bias is generally small or negligible in the delta-adjusted PMM, but can be sizable in the control-based PMM. This indicates that the inference based on Rubin's rule is approximately valid in the delta-adjusted PMM. A simple variance estimator is proposed to ensure asymptotically valid MI inferences in these PMMs, and compared with the bootstrap variance. The proposed method is illustrated by the analysis of an antidepressant trial, and its performance is further evaluated via a simulation study. © 2017, The International Biometric Society.
Tsui, Circe; Coleman, Laura E.; Griffith, Jacqulyn L.; Bennett, E. Andrew; Goodson, Summer G.; Scott, Jason D.; Pittard, W. Stephen; Devine, Scott E.
An international effort is underway to generate a comprehensive haplotype map (HapMap) of the human genome represented by an estimated 300 000 to 1 million ‘tag’ single nucleotide polymorphisms (SNPs). Our analysis indicates that the current human SNP map is not sufficiently dense to support the HapMap project. For example, 24.6% of the genome currently lacks SNPs at the minimal density and spacing that would be required to construct even a conservative tag SNP map containing 300 000 SNPs. In an effort to improve the human SNP map, we identified 140 696 additional SNP candidates using a new bioinformatics pipeline. Over 51 000 of these SNPs mapped to the largest gaps in the human SNP map, leading to significant improvements in these regions. Our SNPs will be immediately useful for the HapMap project, and will allow for the inclusion of many additional genomic intervals in the final HapMap. Nevertheless, our results also indicate that additional SNP discovery projects will be required both to define the haplotype architecture of the human genome and to construct comprehensive tag SNP maps that will be useful for genetic linkage studies in humans. PMID:12907734
Jonathan La Mantia
Full Text Available Populus species are currently being domesticated through intensive time- and resource-dependent programs for utilization in phytoremediation, wood and paper products, and conversion to biofuels. Poplar leaf rust disease can greatly reduce wood volume. Genetic resistance is effective in reducing economic losses but major resistance loci have been race-specific and can be readily defeated by the pathogen. Developing durable disease resistance requires the identification of non-race-specific loci. In the presented study, area under the disease progress curve was calculated from natural infection of Melampsora ×columbiana in three consecutive years. Association analysis was performed using 412 P. trichocarpa clones genotyped with 29,355 SNPs covering 3,543 genes. We found 40 SNPs within 26 unique genes significantly associated (permutated P<0.05 with poplar rust severity. Moreover, two SNPs were repeated in all three years suggesting non-race-specificity and three additional SNPs were differentially expressed in other poplar rust interactions. These five SNPs were found in genes that have orthologs in Arabidopsis with functionality in pathogen induced transcriptome reprogramming, Ca²⁺/calmodulin and salicylic acid signaling, and tolerance to reactive oxygen species. The additive effect of non-R gene functional variants may constitute high levels of durable poplar leaf rust resistance. Therefore, these findings are of significance for speeding the genetic improvement of this long-lived, economically important organism.
Koefoed, Pernille; Andreassen, Ole A; Bennike, Bente
of complex diseases, it may be useful to look at combinations of genotypes. Genes related to signal transmission, e.g., ion channel genes, may be of interest in this respect in the context of bipolar disorder. In the present study, we analysed 803 SNPs in 55 genes related to aspects of signal transmission...... and calculated all combinations of three genotypes from the 3×803 SNP genotypes for 1355 controls and 607 patients with bipolar disorder. Four clusters of patient-specific combinations were identified. Permutation tests indicated that some of these combinations might be related to bipolar disorder. The WTCCC...... in the clusters in the two datasets. The present analyses of the combinations of SNP genotypes support a role for both genetic heterogeneity and interactions in the genetic architecture of bipolar disorder....
Full Text Available The aim of this study was identification of SNPs in leptin (LEP, leptin receptor (LEPR, growth hormone (GH and specific pituitary transcription factor (Pit-1 genes in order to analyze genetic structure of Charolais bulls’ population. The total numbers of genomic DNA samples were taken from 52 breeding bulls and analyzed by PCR-RFLP method. After digestion with restriction enzymes were detected in bulls’ population alleles with frequency: LEP/Sau3AI A 0.83 and B 0.17 (±0.037; LEPR/BseGI C 0.95 and T 0.05 (±0.021, GH/AluI L 0.62 and V 0.38 (±0.048 and Pit1/HinfI A 0.40 and B 0.60 (±0.048. Based on the observed vs. expected genotypes frequencies population across loci were in Hardy-Weinberg equilibrium (P>0.05, only in case of Pit-1 locus was detected disequilibrium. Predominant were in analyzed breeding bulls LEP/Sau3AIAA (0.69, LEPR/T945MCC (0.90, GH/AluILL (0.43 and Pit-1/HinfIAB (0.65 genotypes. The observed heterozygosity of SNPs was also transferred to the low (LEP/Sau3AI/0.248 and LEPR/T945M/0.088 or median polymorphic information content (GH/AluI/0.366 and Pit-1/HinfI/0.370. Within genetic variability estimating negative (LEPR/T945M and Pit-1/HinfI and positive values (LEP/Sau3AI and GH/AluI of fixation indexes FIS indicating slight heterozygote excess or deficiency based on analyzed genetic marker were observed.
Full Text Available Abstract Background Genotyping technologies enable us to genotype multiple Single Nucleotide Polymorphisms (SNPs within selected genes/regions, providing data for haplotype association analysis. While haplotype-based association analysis is powerful for detecting untyped causal alleles in linkage-disequilibrium (LD with neighboring SNPs/haplotypes, the inclusion of extraneous SNPs could reduce its power by increasing the number of haplotypes with each additional SNP. Methods Here, we propose a haplotype-based stepwise procedure (HBSP to eliminate extraneous SNPs. To evaluate its properties, we applied HBSP to both simulated and real data, generated from a study of genetic associations of the bactericidal/permeability-increasing (BPI gene with pulmonary function in a cohort of patients following bone marrow transplantation. Results Under the null hypothesis, use of the HBSP gave results that retained the desired false positive error rates when multiple comparisons were considered. Under various alternative hypotheses, HBSP had adequate power to detect modest genetic associations in case-control studies with 500, 1,000 or 2,000 subjects. In the current application, HBSP led to the identification of two specific SNPs with a positive validation. Conclusion These results demonstrate that HBSP retains the essence of haplotype-based association analysis while improving analytic power by excluding extraneous SNPs. Minimizing the number of SNPs also enables simpler interpretation and more cost-effective applications.
Watson, Corey T.; Disanto, Giulio; Breden, Felix; Giovannoni, Gavin; Ramagopalan, Sreeram V.
Multiple sclerosis (MS) is a complex disease with underlying genetic and environmental factors. Although the contribution of alleles within the major histocompatibility complex (MHC) are known to exert strong effects on MS risk, much remains to be learned about the contributions of loci with more modest effects identified by genome-wide association studies (GWASs), as well as loci that remain undiscovered. We use a recently developed method to estimate the proportion of variance in disease liability explained by 475,806 single nucleotide polymorphisms (SNPs) genotyped in 1,854 MS cases and 5,164 controls. We reveal that ~30% of MS genetic liability is explained by SNPs in this dataset, the majority of which is accounted for by common variants. These results suggest that the unaccounted for proportion could be explained by variants that are in imperfect linkage disequilibrium with common GWAS SNPs, highlighting the potential importance of rare variants in the susceptibility to MS.
Yucesoy, Berran; Talzhanov, Yerkebulan; Michael Barmada, M; Johnson, Victor J; Kashon, Michael L; Baron, Elma; Wilson, Nevin W; Frye, Bonnie; Wang, Wei; Fluharty, Kara; Gharib, Rola; Meade, Jean; Germolec, Dori; Luster, Michael I; Nedorost, Susan
Irritant contact dermatitis is the most common work-related skin disease, especially affecting workers in "wet-work" occupations. This study was conducted to investigate the association between single nucleotide polymorphisms (SNPs) within the major histocompatibility complex (MHC) and skin irritant response in a group of healthcare workers. 585 volunteer healthcare workers were genotyped for MHC SNPs and patch tested with three different irritants: sodium lauryl sulfate (SLS), sodium hydroxide (NaOH) and benzalkonium chloride (BKC). Genotyping was performed using Illumina Goldengate MHC panels. A number of SNPs within the MHC Class I (OR2B3, TRIM31, TRIM10, TRIM40 and IER3), Class II (HLA-DPA1, HLA-DPB1) and Class III (C2) genes were associated (p MHC can influence chemical-induced skin irritation and may explain the connection between inflamed skin and propensity to subsequent allergic contact sensitization.
Ni, Jiayi; Leong, Aaron; Dasgupta, Kaberi; Rahme, Elham
Outcome misclassification may occur in observational studies using administrative databases. We evaluated a two-step multiple imputation approach based on complementary internal validation data obtained from two subsamples of study participants to reduce bias in hazard ratio (HR) estimates in Cox regressions. We illustrated this approach using data from a surveyed sample of 6247 individuals in a study of statin-diabetes association in Quebec. We corrected diabetes status and onset assessed from health administrative data against self-reported diabetes and/or elevated fasting blood glucose (FBG) assessed in subsamples. The association between statin use and new onset diabetes was evaluated using administrative data and the corrected data. By simulation, we assessed the performance of this method varying the true HR, sensitivity, specificity, and the size of validation subsamples. The adjusted HR of new onset diabetes among statin users versus non-users was 1.61 (95% confidence interval: 1.09-2.38) using administrative data only, 1.49 (0.95-2.34) when diabetes status and onset were corrected based on self-report and undiagnosed diabetes (FBG ≥ 7 mmol/L), and 1.36 (0.92-2.01) when corrected for self-report and undiagnosed diabetes/impaired FBG (≥ 6 mmol/L). In simulations, the multiple imputation approach yielded less biased HR estimates and appropriate coverage for both non-differential and differential misclassification. Large variations in the corrected HR estimates were observed using validation subsamples with low participation proportion. The bias correction was sometimes outweighed by the uncertainty introduced by the unknown time of event occurrence. Multiple imputation is useful to correct for outcome misclassification in time-to-event analyses if complementary validation data are available from subsamples. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Full Text Available Incomplete unemployment data is a fundamental problem when evaluating labour market policies in several countries. Many unemployment spells end for unknown reasons; in the Swedish Public Employment Service’s register as many as 20 percent. This leads to an ambiguity regarding destination states (employment, unemployment, retired, etc.. According to complete combined administrative data, the employment rate among dropouts was close to 50 for the years 1992 to 2006, but from 2007 the employment rate has dropped to 40 or less. This article explores an imputation approach. We investigate imputation models estimated both on survey data from 2005/2006 and on complete combined administrative data from 2005/2006 and 2011/2012. The models are evaluated in terms of their ability to make correct predictions. The models have relatively high predictive power.
Full Text Available The purpose of this study is to demonstrate a way of dealing with missing data in clustered randomized trials by doing multiple imputation (MI with the PAN package in R through SAS. The procedure for doing MI with PAN through SAS is demonstrated in detail in order for researchers to be able to use this procedure with their own data. An illustration of the technique with empirical data was also included. In this illustration thePAN results were compared with pairwise deletion and three types of MI: (1 Normal Model (NM-MI ignoring the cluster structure; (2 NM-MI with dummy-coded cluster variables (fixed cluster structure; and (3 a hybrid NM-MI which imputes half the time ignoring the cluster structure, and the other half including the dummy-coded cluster variables. The empirical analysis showed that using PAN and the other strategies produced comparable parameter estimates. However, the dummy-coded MI overestimated the intraclass correlation, whereas MI ignoring the cluster structure and the hybrid MI underestimated the intraclass correlation. When compared with PAN, the p-value and standard error for the treatment effect were higher with dummy-coded MI, and lower with MI ignoring the clusterstructure, the hybrid MI approach, and pairwise deletion. Previous studies have shown that NM-MI is not appropriate for handling missing data in clustered randomized trials. This approach, in addition to the pairwise deletion approach, leads to a biased intraclass correlation and faultystatistical conclusions. Imputation in clustered randomized trials should be performed with PAN. We have demonstrated an easy way for using PAN through SAS.
Full Text Available Multiple imputation is a popular approach to handling missing data. Although it was originally motivated by survey nonresponse problems, it has been readily applied to other data settings. However, its general behavior still remains unclear when applied to survey data with complex sample designs, including clustering. Recently, Lewis et al. (2014 compared single- and multiple-imputation analyses for certain incomplete variables in the 2008 National Ambulatory Medicare Care Survey, which has a nationally representative, multistage, and clustered sampling design. Their study results suggested that the increase of the variance estimate due to multiple imputation compared with single imputation largely disappears for estimates with large design effects. We complement their empirical research by providing some theoretical reasoning. We consider data sampled from an equally weighted, single-stage cluster design and characterize the process using a balanced, one-way normal random-effects model. Assuming that the missingness is completely at random, we derive analytic expressions for the within- and between-multiple-imputation variance estimators for the mean estimator, and thus conveniently reveal the impact of design effects on these variance estimators. We propose approximations for the fraction of missing information in clustered samples, extending previous results for simple random samples. We discuss some generalizations of this research and its practical implications for data release by statistical agencies.
Full Text Available This study is aimed at variance computation techniques for estimates of population characteristics based on survey sampling and imputation. We use the superpopulation regression model, which means that the target variable values for each statistical unit are treated as random realizations of a linear regression model with weighted variance. We focus on regression models with one auxiliary variable and no intercept, which have many applications and straightforward interpretation in business statistics. Furthermore, we deal with caseswhere the estimates are not independent and thus the covariance must be computed. We also consider chained regression models with auxiliary variables as random variables instead of constants.
Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P.; Zeng, Donglin; Vaughn, Amber E.; Pratt, Charlotte; Ward, Dianne S.
Background: Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult.Objective: The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method.Design: Data were from children aged 2-5 years in the My Parenting...
Pedersen, Alma B; Mikkelsen, Ellen M; Cronin-Fenton, Deirdre; Kristensen, Nickolaj R; Pham, Tra My; Pedersen, Lars; Petersen, Irene
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data.
Dunning, Alison M; Healey, Catherine S; Baynes, Caroline
We have conducted a three-stage, comprehensive single nucleotide polymorphism (SNP)-tagging association study of ESR1 gene variants (SNPs) in more than 55,000 breast cancer cases and controls from studies within the Breast Cancer Association Consortium (BCAC). No large risks or highly significant...
Mar 7, 2011 ... Mastitis is a major cause of economic loss in dairy cattle. In this study, the bovine CACNA2D1 gene was taken as a candidate gene for mastitis resistance. The objective of this study was to identify single nucleotide polymorphisms (SNPs) in the bovine CACNA2D1 gene and evaluate the association of these.
Full Text Available The prognostic improvement attributed to genetic markers over current prognostic system has not been well studied for melanoma. The goal of this study is to evaluate the added prognostic value of Vitamin D Pathway (VitD SNPs to currently known clinical and demographic factors such as age, sex, Breslow thickness, mitosis and ulceration (CDF. We utilized two large independent well-characterized melanoma studies: the Genes, Environment, and Melanoma (GEM and MD Anderson studies, and performed variable selection of VitD pathway SNPs and CDF using Random Survival Forest (RSF method in addition to Cox proportional hazards models. The Harrell's C-index was used to compare the performance of model predictability. The population-based GEM study enrolled 3,578 incident cases of cutaneous melanoma (CM, and the hospital-based MD Anderson study consisted of 1,804 CM patients. Including both VitD SNPs and CDF yielded C-index of 0.85, which provided slight but not significant improvement by CDF alone (C-index = 0.83 in the GEM study. Similar results were observed in the independent MD Anderson study (C-index = 0.84 and 0.83, respectively. The Cox model identified no significant associations after adjusting for multiplicity. Our results do not support clinically significant prognostic improvements attributable to VitD pathway SNPs over current prognostic system for melanoma survival.
Supplementary data, J. Genet. 87, xx-xx. Table 2 a. SNPs of PSORS1 (HLA-C; NCBI Build 127). Heterozygosity. Nucleotide. Wild type bases. SNPs. (NCBI) change. Location. HLA-C. HLA-B rs7767581. 0.407. C>G exon 1. C. G rs2308538. 0.442. G>C/T exon 2. G. G rs11547357. 0.476. A>C exon 2. A. A rs16895963. 0.500.
Eskelson, Bianca N.I.; Hagar, Joan; Temesgen, Hailemariam
of large snags than the RF imputation approach. Adjusting the decision threshold to account for unequal size for presence and absence classes is more straightforward for the logistic regression than for the RF imputation approach. Overall, model accuracies were poor in this study, which can be attributed to the poor predictive quality of the explanatory variables and the large range of forest types and geographic conditions observed in the data.
Table 1. List of SNPs within ESR1 gene that was positively associated with diseseases or quantitative traits in association studies. SNP ..... Al-Hendy A. and Salama S. A. 2006 Ethnic distribution of estrogen receptor-alpha polymorphism is associated with a higher prevalence of uterine leiomyomas in black. Americans. Fertil.
Table 1. List of SNPs within ESR1 gene that was positively associated with diseseases or quantitative traits in association studies. SNP. Disease class. Normal variation. Orthopedic. Metabolic. Endocrinologic. CAD. Immune. Cancer. Neurologic Psychologic. Reproduction. Gynecologic. Other rs851993. Ichikawa et al. 2010.
Thomas J Hoffmann
Full Text Available An efficient approach to characterizing the disease burden of rare genetic variants is to impute them into large well-phenotyped cohorts with existing genome-wide genotype data using large sequenced referenced panels. The success of this approach hinges on the accuracy of rare variant imputation, which remains controversial. For example, a recent study suggested that one cannot adequately impute the HOXB13 G84E mutation associated with prostate cancer risk (carrier frequency of 0.0034 in European ancestry participants in the 1000 Genomes Project. We show that by utilizing the 1000 Genomes Project data plus an enriched reference panel of mutation carriers we were able to accurately impute the G84E mutation into a large cohort of 83,285 non-Hispanic White participants from the Kaiser Permanente Research Program on Genes, Environment and Health Genetic Epidemiology Research on Adult Health and Aging cohort. Imputation authenticity was confirmed via a novel classification and regression tree method, and then empirically validated analyzing a subset of these subjects plus an additional 1,789 men from Kaiser specifically genotyped for the G84E mutation (r2 = 0.57, 95% CI = 0.37–0.77. We then show the value of this approach by using the imputed data to investigate the impact of the G84E mutation on age-specific prostate cancer risk and on risk of fourteen other cancers in the cohort. The age-specific risk of prostate cancer among G84E mutation carriers was higher than among non-carriers. Risk estimates from Kaplan-Meier curves were 36.7% versus 13.6% by age 72, and 64.2% versus 24.2% by age 80, for G84E mutation carriers and non-carriers, respectively (p = 3.4x10-12. The G84E mutation was also associated with an increase in risk for the fourteen other most common cancers considered collectively (p = 5.8x10-4 and more so in cases diagnosed with multiple cancer types, both those including and not including prostate cancer, strongly suggesting
Bush William S
Full Text Available Abstract Background Gene-centric analysis tools for genome-wide association study data are being developed both to annotate single locus statistics and to prioritize or group single nucleotide polymorphisms (SNPs prior to analysis. These approaches require knowledge about the relationships between SNPs on a genotyping platform and genes in the human genome. SNPs in the genome can represent broader genomic regions via linkage disequilibrium (LD, and population-specific patterns of LD can be exploited to generate a data-driven map of SNPs to genes. Methods In this study, we implemented LD-Spline, a database routine that defines the genomic boundaries a particular SNP represents using linkage disequilibrium statistics from the International HapMap Project. We compared the LD-Spline haplotype block partitioning approach to that of the four gamete rule and the Gabriel et al. approach using simulated data; in addition, we processed two commonly used genome-wide association study platforms. Results We illustrate that LD-Spline performs comparably to the four-gamete rule and the Gabriel et al. approach; however as a SNP-centric approach LD-Spline has the added benefit of systematically identifying a genomic boundary for each SNP, where the global block partitioning approaches may falter due to sampling variation in LD statistics. Conclusion LD-Spline is an integrated database routine that quickly and effectively defines the genomic region marked by a SNP using linkage disequilibrium, with a SNP-centric block definition algorithm.
Full Text Available Understanding the cell-specific binding patterns of transcription factors (TFs is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4% of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding.
Roth, Philip L; Le, Huy; Oh, In-Sue; Van Iddekinge, Chad H; Bobko, Philip
Meta-analysis has become a well-accepted method for synthesizing empirical research about a given phenomenon. Many meta-analyses focus on synthesizing correlations across primary studies, but some primary studies do not report correlations. Peterson and Brown (2005) suggested that researchers could use standardized regression weights (i.e., beta coefficients) to impute missing correlations. Indeed, their beta estimation procedures (BEPs) have been used in meta-analyses in a wide variety of fields. In this study, the authors evaluated the accuracy of BEPs in meta-analysis. We first examined how use of BEPs might affect results from a published meta-analysis. We then developed a series of Monte Carlo simulations that systematically compared the use of existing correlations (that were not missing) to data sets that incorporated BEPs (that impute missing correlations from corresponding beta coefficients). These simulations estimated ρ̄ (mean population correlation) and SDρ (true standard deviation) across a variety of meta-analytic conditions. Results from both the existing meta-analysis and the Monte Carlo simulations revealed that BEPs were associated with potentially large biases when estimating ρ̄ and even larger biases when estimating SDρ. Using only existing correlations often substantially outperformed use of BEPs and virtually never performed worse than BEPs. Overall, the authors urge a return to the standard practice of using only existing correlations in meta-analysis. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Agelink van Rentergem, Joost A; de Vent, Nathalie R; Schmand, Ben A; Murre, Jaap M J; Huizenga, Hilde M
Neuropsychologists administer neuropsychological tests to decide whether a patient is cognitively impaired. This clinical decision is made by comparing a patient's scores to those of healthy participants in a normative sample. In a multivariate normative comparison, a patient's entire profile of scores is compared to scores in a normative sample. Such a multivariate comparison has been shown to improve clinical decision making. However, it requires a multivariate normative data set, which often is unavailable. To obtain such a multivariate normative data set, the authors propose to aggregate healthy control group data from existing neuropsychological studies. As not all studies administered the same tests, this aggregated database will contain substantial amounts of missing data. The authors therefore propose two solutions: multiple imputation and factor modeling. Simulation studies show that factor modeling is preferred over multiple imputation, provided that the factor model is adequately specified. This factor modeling approach will therefore allow routine use of multivariate normative comparisons, enabling more accurate clinical decision making. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group.
Verma, Shefali S; de Andrade, Mariza; Tromp, Gerard; Kuivaniemi, Helena; Pugh, Elizabeth; Namjou-Khales, Bahram; Mukherjee, Shubhabrata; Jarvik, Gail P; Kottyan, Leah C; Burt, Amber; Bradford, Yuki; Armstrong, Gretta D; Derr, Kimberly; Crawford, Dana C; Haines, Jonathan L; Li, Rongling; Crosslin, David; Ritchie, Marylyn D
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R (2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R (2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
Shefali S Verma
Full Text Available The electronic MEdical Records and GEnomics (eMERGE network brings together DNA biobanks linked to electronic health records (EHRs from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes, and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2 were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
Viengkone, Michelle; Derocher, Andrew Edward; Richardson, Evan Shaun; Malenfant, René Michael; Miller, Joshua Moses; Obbard, Martyn E; Dyck, Markus G; Lunn, Nick J; Sahanatien, Vicki; Davis, Corey S
Defining subpopulations using genetics has traditionally used data from microsatellite markers to investigate population structure; however, single-nucleotide polymorphisms (SNPs) have emerged as a tool for detection of fine-scale structure. In Hudson Bay, Canada, three polar bear ( Ursus maritimus ) subpopulations (Foxe Basin (FB), Southern Hudson Bay (SH), and Western Hudson Bay (WH)) have been delineated based on mark-recapture studies, radiotelemetry and satellite telemetry, return of marked animals in the subsistence harvest, and population genetics using microsatellites. We used SNPs to detect fine-scale population structure in polar bears from the Hudson Bay region and compared our results to the current designations using 414 individuals genotyped at 2,603 SNPs. Analyses based on discriminant analysis of principal components (DAPC) and STRUCTURE support the presence of four genetic clusters: (i) Western-including individuals sampled in WH, SH (excluding Akimiski Island in James Bay), and southern FB (south of Southampton Island); (ii) Northern-individuals sampled in northern FB (Baffin Island) and Davis Strait (DS) (Labrador coast); (iii) Southeast-individuals from SH (Akimiski Island in James Bay); and (iv) Northeast-individuals from DS (Baffin Island). Population structure differed from microsatellite studies and current management designations demonstrating the value of using SNPs for fine-scale population delineation in polar bears.
Gao, Yangchun; Li, Shiguo; Zhan, Aibin
Invasive species cause huge damages to ecology, environment and economy globally. The comprehensive understanding of invasion mechanisms, particularly genetic bases of micro-evolutionary processes responsible for invasion success, is essential for reducing potential damages caused by invasive species. The golden star tunicate, Botryllus schlosseri, has become a model species in invasion biology, mainly owing to its high invasiveness nature and small well-sequenced genome. However, the genome-wide genetic markers have not been well developed in this highly invasive species, thus limiting the comprehensive understanding of genetic mechanisms of invasion success. Using restriction site-associated DNA (RAD) tag sequencing, here we developed a high-quality resource of 14,119 out of 158,821 SNPs for B. schlosseri. These SNPs were relatively evenly distributed at each chromosome. SNP annotations showed that the majority of SNPs (63.20%) were located at intergenic regions, and 21.51% and 14.58% were located at introns and exons, respectively. In addition, the potential use of the developed SNPs for population genomics studies was primarily assessed, such as the estimate of observed heterozygosity (H O ), expected heterozygosity (H E ), nucleotide diversity (π), Wright's inbreeding coefficient (F IS ) and effective population size (Ne). Our developed SNP resource would provide future studies the genome-wide genetic markers for genetic and genomic investigations, such as genetic bases of micro-evolutionary processes responsible for invasion success.
Edwards, Jessie K; Cole, Stephen R; Westreich, Daniel; Crane, Heidi; Eron, Joseph J; Mathews, W Christopher; Moore, Richard; Boswell, Stephen L; Lesko, Catherine R; Mugavero, Michael J
Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and nondifferential measurement error in a marginal structural model. We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3,686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality (hazard ratio [HR]: 1.2 [95% confidence interval [CI] = 0.6, 2.3]). The HR for current smoking and therapy [0.4 (95% CI = 0.2, 0.7)] was similar to the HR for no smoking and therapy (0.4; 95% CI = 0.2, 0.6). Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies.
Liu, Benmei; Yu, Mandi; Graubard, Barry I; Troiano, Richard P; Schenker, Nathaniel
The Physical Activity Monitor component was introduced into the 2003-2004 National Health and Nutrition Examination Survey (NHANES) to collect objective information on physical activity including both movement intensity counts and ambulatory steps. Because of an error in the accelerometer device initialization process, the steps data were missing for all participants in several primary sampling units, typically a single county or group of contiguous counties, who had intensity count data from their accelerometers. To avoid potential bias and loss in efficiency in estimation and inference involving the steps data, we considered methods to accurately impute the missing values for steps collected in the 2003-2004 NHANES. The objective was to come up with an efficient imputation method that minimized model-based assumptions. We adopted a multiple imputation approach based on additive regression, bootstrapping and predictive mean matching methods. This method fits alternative conditional expectation (ace) models, which use an automated procedure to estimate optimal transformations for both the predictor and response variables. This paper describes the approaches used in this imputation and evaluates the methods by comparing the distributions of the original and the imputed data. A simulation study using the observed data is also conducted as part of the model diagnostics. Finally, some real data analyses are performed to compare the before and after imputation results. Published 2016. This article is a U.S. Government work and is in the public domain in the USA. Published 2016. This article is a U.S. Government work and is in the public domain in the USA.
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Su, Edwin P; Grauer, Jonathan N
Despite the advantages of large, national datasets, one continuing concern is missing data values. Complete case analysis, where only cases with complete data are analyzed, is commonly used rather than more statistically rigorous approaches such as multiple imputation. This study characterizes the potential selection bias introduced using complete case analysis and compares the results of common regressions using both techniques following unicompartmental knee arthroplasty. Patients undergoing unicompartmental knee arthroplasty were extracted from the 2005 to 2015 National Surgical Quality Improvement Program. As examples, the demographics of patients with and without missing preoperative albumin and hematocrit values were compared. Missing data were then treated with both complete case analysis and multiple imputation (an approach that reproduces the variation and associations that would have been present in a full dataset) and the conclusions of common regressions for adverse outcomes were compared. A total of 6117 patients were included, of which 56.7% were missing at least one value. Younger, female, and healthier patients were more likely to have missing preoperative albumin and hematocrit values. The use of complete case analysis removed 3467 patients from the study in comparison with multiple imputation which included all 6117 patients. The 2 methods of handling missing values led to differing associations of low preoperative laboratory values with commonly studied adverse outcomes. The use of complete case analysis can introduce selection bias and may lead to different conclusions in comparison with the statistically rigorous multiple imputation approach. Joint surgeons should consider the methods of handling missing values when interpreting arthroplasty research. Copyright © 2017 Elsevier Inc. All rights reserved.
Full Text Available Pummelo cultivars are usually difficult to identify morphologically, especially when fruits are unavailable. The problem was addressed in this study with the use of two methods: high resolution melting analysis of SNPs and sequencing of DNA segments. In the first method, a set of 25 SNPs with high polymorphic information content were selected from SNPs predicted by analyzing ESTs and sequenced DNA segments. High resolution melting analysis was then used to genotype 260 accessions including 55 from Myanmar, and 178 different genotypes were thus identified. A total of 99 cultivars were assigned to 86 different genotypes since the known somatic mutants were identical to their original genotypes at the analyzed SNP loci. The Myanmar samples were genotypically different from each other and from all other samples, indicating they were derived from sexual propagation. Statistical analysis showed that the set of SNPs was powerful enough for identifying at least 1000 pummelo genotypes, though the discrimination power varied in different pummelo groups and populations. In the second method, 12 genomic DNA segments of 24 representative pummelo accessions were sequenced. Analysis of the sequences revealed the existence of a high haplotype polymorphism in pummelo, and statistical analysis showed that the segments could be used as genetic barcodes that should be informative enough to allow reliable identification of 1200 pummelo cultivars. The high level of haplotype diversity and an apparent population structure shown by DNA segments and by SNP genotypes, respectively, were discussed in relation to the origin and domestication of the pummelo species.
Ahmad, Taha; Sabet, Samie; Primerano, Donald A; Richards-Waugh, Lauren L; Rankin, Gary O
Cytochrome P450 (CYP) enzyme 2B6 plays a significant role in the stereo-selective metabolism of (S)-methadone to 2-ethyl-1,5-dimethyl-3,3-diphenylpyrrolidine, an inactive methadone metabolite. Elevated (S)-methadone can cause cardiotoxicity by prolonging the QT interval of the heart's electrical cycle. Large inter-individual variability of methadone pharmacokinetics causes discordance in the relationship between dose, plasma concentrations and side effects. The purpose of this study was to determine if one or more single nucleotide polymorphisms (SNPs) located within the CYP2B6 gene contributes to a poor metabolizer phenotype for methadone in these fatal cases. The genetic analysis was conducted on 125 Caucasian methadone-only fatalities obtained from the West Virginia and Kentucky Offices of the Chief Medical Examiner. The frequency of eight exonic and intronic SNPs (rs2279344, rs3211371, rs3745274, rs4803419, rs8192709, rs8192719, rs12721655 and rs35979566) was determined. The frequencies of SNPs rs3745274 (*9, c516G > T, Q172H), and rs8192719 (21563 C > T) were enhanced in the methadone-only group. Higher blood methadone concentrations were observed in individuals who were genotyped homozygous for SNP rs3211371 (*5, c1459C > T, R487C). These results indicate that these three CYP2B6 SNPs are associated with methadone fatalities. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: firstname.lastname@example.org.
Full Text Available The purpose of this study was to evaluate the performance of multiple imputation method in case that missing observation structure is at random and completely at random from the approach of general linear mixed model. The application data of study was consisted of a total 77 heads of Norduz ram lambs at 7 months of age. After slaughtering, pH values measured at five different time points were determined as dependent variable. In addition, hot carcass weight, muscle glycogen level and fasting durations were included as independent variables in the model. In the dependent variable without missing observation, two missing observation structures including Missing Completely at Random (MCAR and Missing at Random (MAR were created by deleting the observations at certain rations (10% and 25%. After that, in data sets that have missing observation structure, complete data sets were obtained using MI (multiple imputation. The results obtained by applying general linear mixed model to the data sets that were completed using MI method were compared to the results regarding complete data. In the mixed model which was applied to the complete data and MI data sets, results whose covariance structures were the same and parameter estimations and standard estimations were rather close to the complete data are obtained. As a result, in this study, it was ensured that reliable information was obtained in mixed model in case of choosing MI as imputation method in missing observation structure and rates of both cases.
Full Text Available Abstract Background Type 2 diabetes mellitus (T2D, a metabolic disorder characterized by insulin resistance and relative insulin deficiency, is a complex disease of major public health importance. Its incidence is rapidly increasing in the developed countries. Complex diseases are caused by interactions between multiple genes and environmental factors. Most association studies aim to identify individual susceptibility single markers using a simple disease model. Recent studies are trying to estimate the effects of multiple genes and multi-locus in genome-wide association. However, estimating the effects of association is very difficult. We aim to assess the rules for classifying diseased and normal subjects by evaluating potential gene-gene interactions in the same or distinct biological pathways. Results We analyzed the importance of gene-gene interactions in T2D susceptibility by investigating 408 single nucleotide polymorphisms (SNPs in 87 genes involved in major T2D-related pathways in 462 T2D patients and 456 healthy controls from the Korean cohort studies. We evaluated the support vector machine (SVM method to differentiate between cases and controls using SNP information in a 10-fold cross-validation test. We achieved a 65.3% prediction rate with a combination of 14 SNPs in 12 genes by using the radial basis function (RBF-kernel SVM. Similarly, we investigated subpopulation data sets of men and women and identified different SNP combinations with the prediction rates of 70.9% and 70.6%, respectively. As the high-throughput technology for genome-wide SNPs improves, it is likely that a much higher prediction rate with biologically more interesting combination of SNPs can be acquired by using this method. Conclusions Support Vector Machine based feature selection method in this research found novel association between combinations of SNPs and T2D in a Korean population.
Full Text Available Abstract Background Genome-wide association studies are useful for discovering genotype–phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into “gene level” effects. Methods Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression—on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. Results We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Conclusions Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort
Khan, S. Sudheer; Srivatsan, P.; Vaishnavi, N.; Mukherjee, Amitava; Chandrasekaran, N.
Highlights: → Bacterial extracellular proteins stabilize the silver nanoparticles. → Adsorption process varies with pH and salt concentration of the interaction medium. → Adsorption process was strongly influenced by surface charge. → Adsorption equilibrium isotherms was fitted well by the Freundlich model. → Kinetics of adsorption was fitted by pseudo-second-order. -- Abstract: Indiscriminate and increased use of silver nanoparticles (SNPs) in consumer products leads to the release of it into the environment. The fate and transport of SNPs in environment remains unknown. We have studied the interaction of SNPs with extracellular protein (ECP) produced by two environmental bacterial species and the adsorption behavior in aqueous solutions. The effect of pH and salt concentrations on the adsorption was also investigated. The adsorption process was found to be dependent on surface charge (zeta potential). The capping of SNPs by ECP was confirmed by Fourier transform infrared spectroscopy and X-ray diffraction. The adsorption of ECP on SNPs was analyzed by Langmuir and Freundlich models, suggesting that the equilibrium adsorption data fitted well with Freundlich model. The equilibrium adsorption data were modeled using the pseudo-first-order and pseudo-second-order kinetic equations. The results indicated that pseudo-second-order kinetic equation would better describe the adsorption kinetics. The capping was stable at environmental pH and salt concentration. The destabilization of nanoparticles was observed at alkaline pH. The study suggests that the stabilization of nanoparticles in the environment might lead to the accumulation and transport of nanomaterials in the environment, and ultimately destabilizes the functioning of the ecosystem.
Kaplan, David; Su, Dan
This article presents findings on the consequences of matrix sampling of context questionnaires for the generation of plausible values in large-scale assessments. Three studies are conducted. Study 1 uses data from PISA 2012 to examine several different forms of missing data imputation within the chained equations framework: predictive mean…
Chang, Hsueh-Wei; Yang, Cheng-Hong; Chang, Phei-Lang; Cheng, Yu-Huei; Chuang, Li-Yeh
Abstract Background The restriction fragment length polymorphism (RFLP) is a common laboratory method for the genotyping of single nucleotide polymorphisms (SNPs). Here, we describe a web-based software, named SNP-RFLPing, which provides the restriction enzyme for RFLP assays on a batch of SNPs and genes from the human, rat, and mouse genomes. Results Three user-friendly inputs are included: 1) NCBI dbSNP "rs" or "ss" IDs; 2) NCBI Entrez gene ID and HUGO gene name; 3) any formats of SNP-in-se...
Full Text Available Abstract Although highly penetrant alleles of BRCA1 and BRCA2 have been shown to predispose to breast cancer, the majority of breast cancer cases are assumed to result from the presence of low-moderate penetrant alleles and environmental carcinogens. Non-synonymous single nucleotide polymorphisms (nsSNPs are hypothesised to contribute to disease susceptibility and approximately 30 per cent of them are predicted to have a biological significance. In this study, we have applied a bioinformatics-based strategy to identify breast cancer-related nsSNPs from 981 carcinogenesis-related genes expressed in breast tissue. Our results revealed a total of 367 validated nsSNPs, 109 (29.7 per cent of which are predicted to affect the protein function (functional nsSNPs, suggesting that these nsSNPs are likely to influence the development and homeostasis of breast tissue and hence contribute to breast cancer susceptibility. Sixty-seven of the functional nsSNPs presented as commonly occurring nsSNPs (minor allele frequencies ≥ 5 per cent, representing excellent candidates for breast cancer susceptibility. Additionally, a non-uniform distribution of the common functional nsSNPs among different human populations was observed: 15 nsSNPs were reported to be present in all populations analysed, whereas another set of 15 nsSNPs was specific to particular population(s. We propose that the nsSNPs analysed in this study constitute a unique resource of potential genetic factors for breast cancer susceptibility. Furthermore, the variations in functional nsSNP allele frequencies across major population backgrounds may point to the potential variability of the molecular basis of breast cancer predisposition and treatment response among different human populations.
Elbasyoni, Ibrahim S; Lorenz, A J; Guttieri, M; Frels, K; Baenziger, P S; Poland, J; Akhunov, E
The utilization of DNA molecular markers in plant breeding to maximize selection response via marker-assisted selection (MAS) and genomic selection (GS) has revolutionized plant breeding. A key factor affecting GS applicability is the choice of molecular marker platform. Genotyping-by-sequencing scored SNPs (GBS-scored SNPs) provides a large number of markers, albeit with high rates of missing data. Array scored SNPs are of high quality, but the cost per sample is substantially higher. The objectives of this study were 1) compare GBS-scored SNPs, and array scored SNPs for genomic selection applications, and 2) compare estimates of genomic kinship and population structure calculated using the two marker platforms. SNPs were compared in a diversity panel consisting of 299 hard winter wheat (Triticum aestivum L.) accessions that were part of a multi-year, multi-environments association mapping study. The panel was phenotyped in Ithaca, Nebraska for heading date, plant height, days to physiological maturity and grain yield in 2012 and 2013. The panel was genotyped using GBS-scored SNPs, and array scored SNPs. Results indicate that GBS-scored SNPs is comparable to or better than Array-scored SNPs for genomic prediction application. Both platforms identified the same genetic patterns in the panel where 90% of the lines were classified to common genetic groups. Overall, we concluded that GBS-scored SNPs have the potential to be the marker platform of choice for genetic diversity and genomic selection in winter wheat. Copyright © 2018 Elsevier B.V. All rights reserved.
Saqi Mansoor AS
Full Text Available Abstract Background There has been an explosion in the number of single nucleotide polymorphisms (SNPs within public databases. In this study we focused on non-synonymous protein coding single nucleotide polymorphisms (nsSNPs, some associated with disease and others which are thought to be neutral. We describe the distribution of both types of nsSNPs using structural and sequence based features and assess the relative value of these attributes as predictors of function using machine learning methods. We also address the common problem of balance within machine learning methods and show the effect of imbalance on nsSNP function prediction. We show that nsSNP function prediction can be significantly improved by 100% undersampling of the majority class. The learnt rules were then applied to make predictions of function on all nsSNPs within Ensembl. Results The measure of prediction success is greatly affected by the level of imbalance in the training dataset. We found the balanced dataset that included all attributes produced the best prediction. The performance as measured by the Matthews correlation coefficient (MCC varied between 0.49 and 0.25 depending on the imbalance. As previously observed, the degree of sequence conservation at the nsSNP position is the single most useful attribute. In addition to conservation, structural predictions made using a balanced dataset can be of value. Conclusion The predictions for all nsSNPs within Ensembl, based on a balanced dataset using all attributes, are available as a DAS annotation. Instructions for adding the track to Ensembl are at http://www.brightstudy.ac.uk/das_help.html
Gottfredson, Nisha C; Sterba, Sonya K; Jackson, Kristina M
Random coefficient-dependent (RCD) missingness is a non-ignorable mechanism through which missing data can arise in longitudinal designs. RCD, for which we cannot test, is a problematic form of missingness that occurs if subject-specific random effects correlate with propensity for missingness or dropout. Particularly when covariate missingness is a problem, investigators typically handle missing longitudinal data by using single-level multiple imputation procedures implemented with long-format data, which ignores within-person dependency entirely, or implemented with wide-format (i.e., multivariate) data, which ignores some aspects of within-person dependency. When either of these standard approaches to handling missing longitudinal data is used, RCD missingness leads to parameter bias and incorrect inference. We explain why multilevel multiple imputation (MMI) should alleviate bias induced by a RCD missing data mechanism under conditions that contribute to stronger determinacy of random coefficients. We evaluate our hypothesis with a simulation study. Three design factors are considered: intraclass correlation (ICC; ranging from .25 to .75), number of waves (ranging from 4 to 8), and percent of missing data (ranging from 20 to 50%). We find that MMI greatly outperforms the single-level wide-format (multivariate) method for imputation under a RCD mechanism. For the MMI analyses, bias was most alleviated when the ICC is high, there were more waves of data, and when there was less missing data. Practical recommendations for handling longitudinal missing data are suggested.
Chua, Alicia S; Egorova, Svetlana; Anderson, Mark C; Polgar-Turcsanyi, Mariann; Chitnis, Tanuja; Weiner, Howard L; Guttmann, Charles R G; Bakshi, Rohit; Healy, Brian C
Automated segmentation of brain MRI scans into tissue classes is commonly used for the assessment of multiple sclerosis (MS). However, manual correction of the resulting brain tissue label maps by an expert reader remains necessary in many cases. Since automated segmentation data awaiting manual correction are "missing", we proposed to use multiple imputation (MI) to fill-in the missing manually-corrected MRI data for measures of normalized whole brain volume (brain parenchymal fraction-BPF) and T2 hyperintense lesion volume (T2LV). Automated and manually corrected MRI measures from 1300 patients enrolled in the Comprehensive Longitudinal Investigation of Multiple Sclerosis at the Brigham and Women's Hospital (CLIMB) were identified. Simulation studies were conducted to assess the performance of MI with missing data both missing completely at random and missing at random. An imputation model including the concurrent automated data as well as clinical and demographic variables explained a high proportion of the variance in the manually corrected BPF (R(2)=0.97) and T2LV (R(2)=0.89), demonstrating the potential to accurately impute the missing data. Further, our results demonstrate that MI allows for the accurate estimation of group differences with little to no bias and with similar precision compared to an analysis with no missing data. We believe that our findings provide important insights for efficient correction of automated MRI measures to obviate the need to perform manual correction on all cases. Copyright © 2015 Elsevier Inc. All rights reserved.
Eekhout, Iris; de Vet, Henrica C W; Twisk, Jos W R; Brand, Jaap P L; de Boer, Michiel R; Heymans, Martijn W
Regardless of the proportion of missing values, complete-case analysis is most frequently applied, although advanced techniques such as multiple imputation (MI) are available. The objective of this study was to explore the performance of simple and more advanced methods for handling missing data in cases when some, many, or all item scores are missing in a multi-item instrument. Real-life missing data situations were simulated in a multi-item variable used as a covariate in a linear regression model. Various missing data mechanisms were simulated with an increasing percentage of missing data. Subsequently, several techniques to handle missing data were applied to decide on the most optimal technique for each scenario. Fitted regression coefficients were compared using the bias and coverage as performance parameters. Mean imputation caused biased estimates in every missing data scenario when data are missing for more than 10% of the subjects. Furthermore, when a large percentage of subjects had missing items (>25%), MI methods applied to the items outperformed methods applied to the total score. We recommend applying MI to the item scores to get the most accurate regression model estimates. Moreover, we advise not to use any form of mean imputation to handle missing data. Copyright © 2014 Elsevier Inc. All rights reserved.
Ramos, A.; Crooijmans, R.P.M.A.; Affara, N.A.; Amaral, A.J.; Kerstens, H.H.D.; Megens, H.J.W.C.; Groenen, M.A.M.
Background: The dissection of complex traits of economic importance to the pig industry requires the availability of a significant number of genetic markers, such as single nucleotide polymorphisms (SNPs). This study was conducted to discover several hundreds of thousands of porcine SNPs using next
Coding/functional SNPs change the biological function of a gene and, therefore, could serve as “large-effect” genetic markers. In this study, we used two bioinformatics pipelines, GATK and SAMtools, for discovering coding/functional SNPs with allelic-imbalances associated with total body weight, mus...
Karambataki, M; Malousi, A; Kouidou, S
Single nucleotide polymorphisms (SNPs) are tentatively critical with regard to disease predisposition, but coding synonymous SNPs (sSNPs) are generally considered "neutral". Nevertheless, sSNPs in serine/arginine-rich (SR) and splice-site (SS) exonic splicing enhancers (ESEs) or in exonic CpG methylation targets, could be decisive for splicing, particularly in aging-related conditions, where mis-splicing is frequently observed. We presently identified 33 genes T2D-related and 28 related to neurodegenerative diseases, by investigating the impact of the corresponding coding sSNPs on splicing and using gene ontology data and computational tools. Potentially critical (prominent) sSNPs comply with the following criteria: changing the splicing potential of prominent SR-ESEs or of significant SS-ESEs by >1.5 units (Δscore), or formation/deletion of ESEs with maximum splicing score. We also noted the formation/disruption of CpGs (tentative methylation sites of epigenetic sSNPs). All disease association studies involving sSNPs are also reported. Only 21/670 coding SNPs, mostly epigenetic, reported in 33 T2D-related genes, were found to be prominent coding synonymous. No prominent sSNPs have been recorded in three key T2D-related genes (GCGR, PPARGC1A, IGF1). Similarly, 20/366 coding synonymous were identified in ND related genes, mostly epigenetic. Meta-analysis showed that 17 of the above prominent sSNPs were previously investigated in association with various pathological conditions. Three out of four sSNPs (all epigenetic) were associated with T2D and one with NDs (branch site sSNP). Five were associated with other or related pathological conditions. None of the four sSNPs introducing new ESEs was found to be disease-associated. sSNPs introducing smaller Δscore changes (<1.5) in key proteins (INSR, IRS1, DISC1) were also correlated to pathological conditions. This data reveals that genetic variation in splicing-regulatory and particularly CpG sites might be related to
... money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS AND... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...
Rose, Roderick A.; Fraser, Mark W.
Missing data are nearly always a problem in research, and missing values represent a serious threat to the validity of inferences drawn from findings. Increasingly, social science researchers are turning to multiple imputation to handle missing data. Multiple imputation, in which missing values are replaced by values repeatedly drawn from…
Davey, Adam; Shanahan, Michael J.; Schafer, Joseph L.
Principal components analysis revealed four patterns of nonresponse on children's psychosocial adjustment, lifetime poverty experiences, and family history. Results from examining latent growth curve models using listwise deletion and multiple imputation indicated that multiple imputation corrected for selective nonresponse, providing less-biased…
Wolkowitz, Amanda A.; Skorupski, William P.
When missing values are present in item response data, there are a number of ways one might impute a correct or incorrect response to a multiple-choice item. There are significantly fewer methods for imputing the actual response option an examinee may have provided if he or she had not omitted the item either purposely or accidentally. This…
Bartlett, Jonathan W; Seaman, Shaun R; White, Ian R; Carpenter, James R
Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation. Imputation of partially observed covariates is complicated if the substantive model is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of multiple imputation may impute covariates from models that are incompatible with such substantive models. We show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model. We investigate through simulation the performance of this proposal, and compare it with existing approaches. Simulation results suggest our proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible. Stata software implementing the approach is freely available. © The Author(s) 2014.
Janet L. Ohmann; Matthew J. Gregory; Emilie B. Henderson; Heather M. Roberts
Question: How can nearest-neighbour (NN) imputation be used to develop maps of multiple species and plant communities? Location: Western and central Oregon, USA, but methods are applicable anywhere. Methods: We demonstrate NN imputation by mapping woody plant communities for >100 000 km2 of diverse forests and woodlands. Species abundances on...
Si, Yajuan; Reiter, Jerome P.
In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian,…
Full Text Available Many real-world medical datasets contain some proportion of missing (attribute values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.
K. Estrada Gil (Karol); A. Abuseiris (Anis); F.G. Grosveld (Frank); A.G. Uitterlinden (André); T.A. Knoch (Tobias); F. Rivadeneira Ramirez (Fernando)
textabstractThe current fast growth of genome-wide association studies (GWAS) combined with now common computationally expensive imputation requires the online access of large user groups to high-performance computing resources capable of analyzing rapidly and efficiently millions of genetic
Y.J. Kim (Young Jin); J. Lee (Juyoung); B.-J. Kim (Bong-Jo); T. Park (Taesung); G.R. Abecasis (Gonçalo); M.A.A. De Almeida (Marcio); D. Altshuler (David); J.L. Asimit (Jennifer L.); G. Atzmon (Gil); M. Barber (Mathew); A. Barzilai (Ari); N.L. Beer (Nicola L.); G.I. Bell (Graeme I.); J. Below (Jennifer); T. Blackwell (Tom); J. Blangero (John); M. Boehnke (Michael); D.W. Bowden (Donald W.); N.P. Burtt (Noël); J.C. Chambers (John); H. Chen (Han); P. Chen (Ping); P.S. Chines (Peter); S. Choi (Sungkyoung); C. Churchhouse (Claire); P. Cingolani (Pablo); B.K. Cornes (Belinda); N.J. Cox (Nancy); A.G. Day-Williams (Aaron); A. Duggirala (Aparna); J. Dupuis (Josée); T. Dyer (Thomas); S. Feng (Shuang); J. Fernandez-Tajes (Juan); T. Ferreira (Teresa); T.E. Fingerlin (Tasha E.); J. Flannick (Jason); J.C. Florez (Jose); P. Fontanillas (Pierre); T.M. Frayling (Timothy); C. Fuchsberger (Christian); E. Gamazon (Eric); K. Gaulton (Kyle); S. Ghosh (Saurabh); B. Glaser (Benjamin); A.L. Gloyn (Anna); R.L. Grossman (Robert L.); J. Grundstad (Jason); C. Hanis (Craig); A. Heath (Allison); H. Highland (Heather); M. Horikoshi (Momoko); I.-S. Huh (Ik-Soo); J.R. Huyghe (Jeroen R.); M.K. Ikram (Kamran); K.A. Jablonski (Kathleen); Y. Jun (Yang); N. Kato (Norihiro); J. Kim (Jayoun); Y.J. Kim (Young Jin); B.-J. Kim (Bong-Jo); J. Lee (Juyoung); C.R. King (C. Ryan); J.S. Kooner (Jaspal S.); M.-S. Kwon (Min-Seok); H.K. Im (Hae Kyung); M. Laakso (Markku); K.K.-Y. Lam (Kevin Koi-Yau); J. Lee (Jaehoon); S. Lee (Selyeong); S. Lee (Sungyoung); D.M. Lehman (Donna M.); H. Li (Heng); C.M. Lindgren (Cecilia); X. Liu (Xuanyao); O.E. Livne (Oren E.); A.E. Locke (Adam E.); A. Mahajan (Anubha); J.B. Maller (Julian B.); A.K. Manning (Alisa K.); T.J. Maxwell (Taylor J.); A. Mazoure (Alexander); M.I. McCarthy (Mark); J.B. Meigs (James B.); B. Min (Byungju); K.L. Mohlke (Karen); A.P. Morris (Andrew); S. Musani (Solomon); Y. Nagai (Yoshihiko); M.C.Y. Ng (Maggie C.Y.); D. Nicolae (Dan); S. Oh (Sohee); N.D. Palmer (Nicholette); T. Park (Taesung); T.I. Pollin (Toni I.); I. Prokopenko (Inga); D. Reich (David); M.A. Rivas (Manuel); L.J. Scott (Laura); M. Seielstad (Mark); Y.S. Cho (Yoon Shin); X. Sim (Xueling); R. Sladek (Rob); P. Smith (Philip); I. Tachmazidou (Ioanna); E.S. Tai (Shyong); Y.Y. Teo (Yik Ying); T.M. Teslovich (Tanya M.); J. Torres (Jason); V. Trubetskoy (Vasily); S.M. Willems (Sara); A.L. Williams (Amy L.); J.G. Wilson (James); S. Wiltshire (Steven); S. Won (Sungho); A.R. Wood (Andrew); W. Xu (Wang); J. Yoon (Joon); M. Zawistowski (Matthew); E. Zeggini (Eleftheria); W. Zhang (Weihua); S. Zöllner (Sebastian)
textabstractBackground: Rare variants have gathered increasing attention as a possible alternative source of missing heritability. Since next generation sequencing technology is not yet cost-effective for large-scale genomic studies, a widely used alternative approach is imputation. However, the
Genotyping-by-sequencing allows for large-scale genetic analyses in plant species with no reference genome, creating the challenge of sound inference in the presence of uncertain genotypes. Here we report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundina...
Michael J. Falkowski; Andrew T. Hudak; Nicholas L. Crookston; Paul E. Gessler; Edward H. Uebler; Alistair M. S. Smith
Sustainable forest management requires timely, detailed forest inventory data across large areas, which is difficult to obtain via traditional forest inventory techniques. This study evaluated k-nearest neighbor imputation models incorporating LiDAR data to predict tree-level inventory data (individual tree height, diameter at breast height, and...
Cauthen, Katherine Regina [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lambert, Gregory [Apple Inc., Cupertino, CA (United States); Ray, Jaideep [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Lefantzi, Sophia [Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. Here, we use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, and the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.
Ding, Yaohui; Ross, Arun
While fusion can be accomplished at multiple levels in a multibiometric system, score level fusion is commonly used as it offers a good trade-off between fusion complexity and data availability. However, missing scores affect the implementation of several biometric fusion rules. While there are several techniques for handling missing data, the imputation scheme - which replaces missing values with predicted values - is preferred since this scheme can be followed by a standard fusion scheme designed for complete data. This paper compares the performance of three imputation methods: Imputation via Maximum Likelihood Estimation (MLE), Multiple Imputation (MI) and Random Draw Imputation through Gaussian Mixture Model estimation (RD GMM). A novel method called Hot-deck GMM is also introduced and exhibits markedly better performance than the other methods because of its ability to preserve the local structure of the score distribution. Experiments on the MSU dataset indicate the robustness of the schemes in handling missing scores at various missing data rates.
Hernández, Gilma; Moriña, David; Navarro, Albert
The presence of missing data in collected variables is common in health surveys, but the subsequent imputation thereof at the time of analysis is not. Working with imputed data may have certain benefits regarding the precision of the estimators and the unbiased identification of associations between variables. The imputation process is probably still little understood by many non-statisticians, who view this process as highly complex and with an uncertain goal. To clarify these questions, this note aims to provide a straightforward, non-exhaustive overview of the imputation process to enable public health researchers ascertain its strengths. All this in the context of dichotomous variables which are commonplace in public health. To illustrate these concepts, an example in which missing data is handled by means of simple and multiple imputation is introduced. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.
Sanchez Sanchez, Juan Jose; Phillips, C.; Børsting, Claus
for amplifying 52 genomic DNA fragments, each containing one SNP, in a single tube, and accurately genotyping the PCR product mixture using two single base extension reactions. This multiplex approach reduces the cost of SNP genotyping and requires as little as 0.5 ng of genomic DNA to detect 52 SNPs. We used...
Supplementary data: SNPs in genes with copy number variation: A question of specificity. Mainak Sengupta, Ananya Ray, Moumita Chaki, Mahua ... withdrawn in Build 127 are in bold. The potential PSVs are italicized and underlined. *Same as rs17134763 of HBA2; '–' base is absent in HBM at the equivalent position.
Sanchez Sanchez, Juan Jose; Børsting, Claus; Morling, Niels
We describe a method for the simultaneous typing of Y-chromosome single nucleotide polymorphism (SNP) markers by means of multiplex polymerase chain reaction (PCR) strategies that allow the detection of 35 Y chromosome SNPs on 25 amplicons from 100 to 200 pg of chromosomal deoxyribonucleic acid...
Table 1. Basic characteristics of 32 SNPs in neurotransmitter-related genes. Gene. SNP ID. Allele variants. Chromosome. Genomic position (bp). Intermarker distances (bp). Genic position .... (head to back of shoulder), Middle (back of shoulder to hind-quarters), Hind-quarters, and Legs (from the accessory digit upwards).
Abe, Makiko; Ito, Hidemi; Oze, Isao; Nomura, Masatoshi; Ogawa, Yoshihiro; Matsuo, Keitaro
Little is known about the difference of genetic predisposition for CRC between ethnicities; however, many genetic traits common to colorectal cancer have been identified. This study investigated whether more SNPs identified in GWAS in East Asian population could improve the risk prediction of Japanese and explored possible application of genetic risk groups as an instrument of the risk communication. 558 Patients histologically verified colorectal cancer and 1116 first-visit outpatients were included for derivation study, and 547 cases and 547 controls were for replication study. Among each population, we evaluated prediction models for the risk of CRC that combined the genetic risk group based on SNPs from GWASs in European-population and a similarly developed model adding SNPs from GWASs in East Asian-population. We examined whether adding East Asian-specific SNPs would improve the discrimination. Six SNPs (rs6983267, rs4779584, rs4444235, rs9929218, rs10936599, rs16969681) from 23 SNPs by European-based GWAS and five SNPs (rs704017, rs11196172, rs10774214, rs647161, rs2423279) among ten SNPs by Asian-based GWAS were selected in CRC risk prediction model. Compared with a 6-SNP-based model, an 11-SNP model including Asian GWAS-SNPs showed improved discrimination capacity in Receiver operator characteristic analysis. A model with 11 SNPs resulted in statistically significant improvement in both derivation (P = 0.0039) and replication studies (P = 0.0018) compared with six SNP model. We estimated cumulative risk of CRC by using genetic risk group based on 11 SNPs and found that the cumulative risk at age 80 is approximately 13% in the high-risk group while 6% in the low-risk group. We constructed a more efficient CRC risk prediction model with 11 SNPs including newly identified East Asian-based GWAS SNPs (rs704017, rs11196172, rs10774214, rs647161, rs2423279). Risk grouping based on 11 SNPs depicted lifetime difference of CRC risk. This might be useful for
Jerez, José M; Molina, Ignacio; García-Laencina, Pedro J; Alba, Emilio; Ribelles, Nuria; Martín, Miguel; Franco, Leonardo
Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures. Copyright © 2010 Elsevier B.V. All rights reserved.
Davis, Nicholas G; Houston, Derek D; Nason, John D
Single-nucleotide polymorphism (SNP) primers were developed for a native North American desert fig, Ficus petiolaris (Moraceae), to provide markers for population genetic studies designed to quantify patterns of gene flow across a complex landscape. Transcriptome sequencing and bioinformatic protocols were implemented to discover SNPs in single-copy protein-coding genes. Multiplexes of 30 nuclear and 24 organellar (chloroplast and mitochondrial) SNPs were selected for primer development and genotyping on the Sequenom MASSArray System. Of these 54 loci, 49 reliably amplified across a panel of 96 F. petiolaris individuals. This study has provided SNP primers that can be applied in future studies investigating population genetics of F. petiolaris and its coevolution with associated pollinating and nonpollinating fig wasps.
Full Text Available miR-155 has been confirmed to be a key factor in immune responses in humans and other mammals. Therefore, investigation of variations in miR-155 could be useful for understanding the differences in immunity between individuals. In this study, four SNPs in miR-155 were identified in mice (Mus musculus and humans (Homo sapiens. In mice, the four SNPs were closely linked and formed two miR-155 haplotypes (A and B. Ten distinct types of blood parameters were associated with miR-155 expression under normal conditions. Additionally, 4 and 14 blood parameters were significantly different between these two genotypes under normal and lipopolysaccharide (LPS stimulation conditions, respectively. Moreover, the expression levels of miR-155, the inflammatory response to LPS stimulation and the lethal ratio following Salmonella typhimurium infection were significantly increased in mice harboring the AA genotype. Further, two SNPs, one in the loop region and the other near the 3' terminal of pre-miR-155, were confirmed to be responsible for the differential expression of miR-155 in mice. Interestingly, two additional SNPs, one in the loop region and the other in the middle of miR-155*, modulated the function of miR-155 in humans. Predictions of secondary RNA structure using RNAfold showed that these SNPs affected the structure of miR-155 in both mice and humans. Our results provide novel evidence of the natural functional SNPs of miR-155 in both mice and humans, which may affect the expression levels of mature miR-155 by modulating its secondary structure. The SNPs of human miR-155 may be considered as causal mutations for some immune-related diseases in the clinic. The two genotypes of mice could be used as natural models for studying the mechanisms of immune diseases caused by abnormal expression of miR-155 in humans.
Liu, Yuzhe; Gopalakrishnan, Vanathi
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
Ghazy, Amany A; El-Etreby, Nour M
Ovarian cancer is one of the most lethal gynecological malignancies and the fifth leading cause of cancer deaths among women. The high mortality rate is largely attributed to its diagnosis in advanced stages. Poor prognosis of ovarian cancer is usually due to the lack of specific or effective screening and diagnostic methods for identifying early-stage disease. Our study aimed to study the role of HLA-DP, HLA-DQ, and ICAM-1 SNPs in diagnosis and/or prognosis of ovarian tumors. The current study was conducted on 60 patients with ovarian tumors (benign, borderline, and malignant) and 20 healthy volunteers. Genotyping of HLA-DP rs3077, HLA-DQ rs3920, and ICAM-1 rs1437 SNPs was done using 5' nuclease assay. We found significant association of HLA-DP rs3077 AA, HLA-DQ rs3920 GG, ICAM-1 rs1437 CC, and CT genotypes with increased risk of ovarian cancer (OR = 43.5, 6, 25, and 2.6, respectively). In addition, HLA-DQ rs3920 and ICAM-1 rs1437 alleles vary significantly among different types of ovarian cancer (P = 0.003 and 0.001, respectively). HLA-DP rs3077, HLA-DQ rs3920, and ICAM-1 rs1437 SNPs could help in the diagnosis and prognosis of ovarian cancer.
Byrne, Enda M; Gehrman, Philip R; Medland, Sarah E; Nyholt, Dale R; Heath, Andrew C; Madden, Pamela A F; Hickie, Ian B; Van Duijn, Cornelia M; Henders, Anjali K; Montgomery, Grant W; Martin, Nicholas G; Wray, Naomi R
Several aspects of sleep behavior such as timing, duration and quality have been demonstrated to be heritable. To identify common variants that influence sleep traits in the population, we conducted a genome-wide association study of six sleep phenotypes assessed by questionnaire in a sample of 2,323 individuals from the Australian Twin Registry. Genotyping was performed on the Illumina 317, 370, and 610K arrays and the SNPs in common between platforms were used to impute non-genotyped SNPs. We tested for association with more than 2,000,000 common polymorphisms across the genome. While no SNPs reached the genome-wide significance threshold, we identified a number of associations in plausible candidate genes. Most notably, a group of SNPs in the third intron of the CACNA1C gene ranked as most significant in the analysis of sleep latency (P = 1.3 × 10⁻⁶). We attempted to replicate this association in an independent sample from the Chronogen Consortium (n = 2,034), but found no evidence of association (P = 0.73). We have identified several other suggestive associations that await replication in an independent sample. We did not replicate the results from previous genome-wide analyses of self-reported sleep phenotypes after correction for multiple testing. Copyright © 2013 Wiley Periodicals, Inc.
Scott, Laura J.; Mohlke, Karen L.; Bonnycastle, Lori L.; Willer, Cristen J.; Li, Yun; Duren, William L.; Erdos, Michael R.; Stringham, Heather M.; Chines, Peter S.; Jackson, Anne U.; Prokunina-Olsson, Ludmila; Ding, Chia-Jen; Swift, Amy J.; Narisu, Narisu; Hu, Tianle; Pruim, Randall; Xiao, Rui; Li, Xiao-Yi; Conneely, Karen N.; Riebow, Nancy L.; Sprau, Andrew G.; Tong, Maurine; White, Peggy P.; Hetrick, Kurt N.; Barnhart, Michael W.; Bark, Craig W.; Goldstein, Janet L.; Watkins, Lee; Xiang, Fang; Saramies, Jouko; Buchanan, Thomas A.; Watanabe, Richard M.; Valle, Timo T.; Kinnunen, Leena; Abecasis, Gonçalo R.; Pugh, Elizabeth W.; Doheny, Kimberly F.; Bergman, Richard N.; Tuomilehto, Jaakko; Collins, Francis S.; Boehnke, Michael
Identifying the genetic variants that increase the risk of type 2 diabetes (T2D) in humans has been a formidable challenge. Adopting a genome-wide association strategy, we genotyped 1161 Finnish T2D cases and 1174 Finnish normal glucose-tolerant (NGT) controls with >315,000 single-nucleotide polymorphisms (SNPs) and imputed genotypes for an additional >2 million autosomal SNPs. We carried out association analysis with these SNPs to identify genetic variants that predispose to T2D, compared our T2D association results with the results of two similar studies, and genotyped 80 SNPs in an additional 1215 Finnish T2D cases and 1258 Finnish NGT controls. We identify T2D-associated variants in an intergenic region of chromosome 11p12, contribute to the identification of T2D-associated variants near the genes IGF2BP2 and CDKAL1 and the region of CDKN2A and CDKN2B, and confirm that variants near TCF7L2, SLC30A8, HHEX, FTO, PPARG, and KCNJ11 are associated with T2D risk. This brings the number of T2D loci now confidently identified to at least 10. PMID:17463248
Damaso, Natalie; Martin, Lauren; Kushwaha, Priyanka; Mills, DeEtta
Ecological studies of microbial communities often use profiling methods but the true community diversity can be underestimated in methods that separate amplicons based on sequence length using performance optimized polymer 4. Taxonomically, unrelated organisms can produce the same length amplicon even though the amplicons have different sequences. F-108 polymer has previously been shown to resolve same length amplicons by sequence polymorphisms. In this study, we showed F-108 polymer, using the ABI Prism 310 Genetic Analyzer and CE, resolved four bacteria that produced the same length amplicon for the 16S rRNA domain V3 but have variable nucleotide content. Second, a microbial mat community profile was resolved and supported by NextGen sequencing where the number of peaks in the F-108 profile was in concordance with the confirmed species numbers in the mat. Third, equine DNA was analyzed for SNPs. The F-108 polymer was able to distinguish heterozygous and homozygous individuals for the melanocortin 1 receptor coat color gene. The method proved to be rapid, inexpensive, reproducible, and uses common CE instruments. The potential for F-108 to resolve DNA mixtures or SNPs can be applied to various sample types-from SNPs to forensic mixtures to ecological communities. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Helenowski, Irene B.; Demirtas, Hakan; Khan, Seema; Eladoumikdachi, Firas; Shidfar, Ali
Tumor size based on mammographic and ultrasound data are two methods used in predicting recurrence in breast cancer patients. Which technology offers better determination of diagnosis is an ongoing debate among radiologists, biophysicists, and other clinicians, however. Further complications in assessing the performance of each technology arise from missing data. One approach to remedy this problem may involve multiple imputation. Here, we therefore examine how imputation affects our assessment of the relationship between recurrence and tumor size determined either by mammography of ultrasound technology. We specifically employ the semi-parametric approach for imputing mixed continuous and binary data as presented in Helenowski and Demirtas (2013).
Li, Yan-Yun; Xing, Jun; Zhao, Lin-Sheng; Li, Yan-Ni; Wang, Yu-Chuan; Zhang, Wei-Ming
Polymorphisms of human leukocyte antigen (HLA) gene play an important role in the development of cervical cancer. This study was to screen single nucleotide polymorphisms (SNPs) of HLA-DQA1 gene involved in susceptibility of cervical cancer by a bioinformatics approach, and analyze their correlations to abnormal gene functions. SNPs of HLA-DQA1 were screened from a public database dbSNP by SNPper software, and relevant FASTA subsequences were also obtained from dbSNP. PARSESNP software was used to analyze cSNPs. Two SNPs, rs9272693 and rs9272703, which may induce mis-sense mutation, were identified in codon region of HLA-DQA1 gene. A PSSM difference>10 was used to predict deleterious mutation. SNPper software in combination with PARSESNP software could be used to analyze SNPs of HLA-DQA1 gene and select the variants in a conserved region, and it provides an evaluation criterion. But the results need to be verified in cervical cancer patients and control populations.
Siersma, Volkert Dirk; Johansen, Christoffer
nonparametric bootstrap, bootstrap confidence intervals, missing values, multiple imputation, matched case-control study......nonparametric bootstrap, bootstrap confidence intervals, missing values, multiple imputation, matched case-control study...
Full Text Available The power of SNP association studies to detect valid relationships with clinical phenotypes in schizophrenia is largely limited by the number of SNPs selected and non-specificity of phenotypes. To address this, we first assessed performance on two visual perceptual organization tasks designed to avoid many generalized deficit confounds, Kanizsa shape perception and contour integration, in a schizophrenia patient sample. Then, to reduce the total number of candidate SNPs analyzed in association with perceptual organization phenotypes, we employed a two-stage strategy: first a priori SNPs from three candidate genes were selected (GAD1, NRG1 and DTNBP1; then a Hierarchical Classes Analysis (HICLAS was performed to reduce the total number of SNPs, based on statistically related SNP clusters. HICLAS reduced the total number of candidate SNPs for subsequent phenotype association analyses from 6 to 3. MANCOVAs indicated that rs10503929 and rs1978340 were associated with the Kanizsa shape perception filling in metric but not the global shape detection metric. rs10503929 was also associated with altered contour integration performance. SNPs not selected by the HICLAS model were unrelated to perceptual phenotype indices. While the contribution of candidate SNPs to perceptual impairments requires further clarification, this study reports the first application of HICLAS as a hypothesis-independent mathematical method for SNP data reduction. HICLAS may be useful for future larger scale genotype-phenotype association studies.
Full Text Available The common variants in the fat mass- and obesity-associated (FTO gene have been previously found to be associated with obesity in various adult populations. The objective of the present study was to investigate whether the single nucleotide polymorphisms (SNPs and linkage disequilibrium (LD blocks in various regions of the FTO gene are associated with predisposition to obesity in Malaysian Malays. Thirty-one FTO SNPs were genotyped in 587 (158 obese and 429 non-obese Malaysian Malay subjects. Obesity traits and lipid profiles were measured and single-marker association testing, LD testing, and haplotype association analysis were performed. LD analysis of the FTO SNPs revealed the presence of 57 regions with complete LD (D’ = 1.0. In addition, we detected the association of rs17817288 with low-density lipoprotein cholesterol. The FTO gene may therefore be involved in lipid metabolism in Malaysian Malays. Two haplotype blocks were present in this region of the FTO gene, but no particular haplotype was found to be significantly associated with an increased risk of obesity in Malaysian Malays.
Nielsen, Kaspar René; Rodrigo-Domingo, Maria; Steffensen, Rudi
-nucleotide polymorphisms (SNPs) involved in the immune response and a subsequent statistical analysis that focusses on the association of SNPs, certain haplotypes or SNP-SNP interactions with MM risk and prognosis. We genotyped 348 Danish patients and 355 controls for 13 SNPs located in the TNFA, IL-4, IL-6, IL-10 and CHI...... were studied for expression in normal B-cell subsets and myeloma plasma cells. We observed a significantly reduced risk when harboring the TNFA-238A allele (OR = 0.51 (0.29-0.86)) and interactions between the TNFA-1031T/C * and IL-10 -3575T/A (p = .007) as well as the TNFA-308G/A * and IL-10-1082G/A (p...... = .008) allels. By statistical approaches, we observed association between prognosis and the TNFA-857CC genotype (HR = 2.80 (1.29-6.10)) and IL-10-1082GG + GA genotypes (HR = 1.93 (1.07-3.49)) and interactions between IL-6-174G/C and IL-10-3575T/A (p = .001) and between TNFA-308G/A and IL-4-1098T/G (p...
Coetzee, Simon G; Pierce, Steven; Brundin, Patrik; Brundin, Lena; Hazelett, Dennis J; Coetzee, Gerhard A
Recent genome-wide association studies (GWAS) of Parkinson's disease (PD) revealed at least 26 risk loci, with associated single nucleotide polymorphisms (SNPs) located in non-coding DNA having unknown functions in risk. In order to explore in which cell types these SNPs (and their correlated surrogates at r(2) ≥ 0.8) could alter cellular function, we assessed their location overlap with histone modification regions that indicate transcription regulation in 77 diverse cell types. We found statistically significant enrichment of risk SNPs at 12 loci in active enhancers or promoters. We investigated 4 risk loci in depth that were most significantly enriched (-logeP > 14) and contained 8 putative enhancers in the different cell types. These enriched loci, along with eQTL associations, were unexpectedly present in non-neuronal cell types. These included lymphocytes, mesendoderm, liver- and fat-cells, indicating that cell types outside the brain are involved in the genetic predisposition to PD. Annotating regulatory risk regions within specific cell types may unravel new putative risk mechanisms and molecular pathways that contribute to PD development.
Coetzee, Simon G.; Pierce, Steven; Brundin, Patrik; Brundin, Lena; Hazelett, Dennis J.; Coetzee, Gerhard A.
Recent genome-wide association studies (GWAS) of Parkinson’s disease (PD) revealed at least 26 risk loci, with associated single nucleotide polymorphisms (SNPs) located in non-coding DNA having unknown functions in risk. In order to explore in which cell types these SNPs (and their correlated surrogates at r2 ≥ 0.8) could alter cellular function, we assessed their location overlap with histone modification regions that indicate transcription regulation in 77 diverse cell types. We found statistically significant enrichment of risk SNPs at 12 loci in active enhancers or promoters. We investigated 4 risk loci in depth that were most significantly enriched (−logeP > 14) and contained 8 putative enhancers in the different cell types. These enriched loci, along with eQTL associations, were unexpectedly present in non-neuronal cell types. These included lymphocytes, mesendoderm, liver- and fat-cells, indicating that cell types outside the brain are involved in the genetic predisposition to PD. Annotating regulatory risk regions within specific cell types may unravel new putative risk mechanisms and molecular pathways that contribute to PD development. PMID:27461410
Full Text Available Lung transplant patients present important variability in immunosuppressant blood concentrations during the first months after transplantation. Pharmacogenetics could explain part of this interindividual variability. We evaluated SNPs in genes that have previously shown correlations in other kinds of solid organ transplantation, namely ABCB1 and CYP3A5 genes with tacrolimus (Tac and ABCC2, UGT1A9 and SLCO1B1 genes with mycophenolic acid (MPA, during the first six months after lung transplantation (51 patients. The genotype was correlated to the trough blood drug concentrations corrected for dose and body weight (C0/Dc. The ABCB1 variant in rs1045642 was associated with significantly higher Tac concentration, at six months post-transplantation (CT vs. CC. In the MPA analysis, CT patients in ABCC2 rs3740066 presented significantly lower blood concentrations than CC or TT, three months after transplantation. Other tendencies, confirming previously expected results, were found associated with the rest of studied SNPs. An interesting trend was recorded for the incidence of acute rejection according to NOD2/CARD15 rs2066844 (CT: 27.9%; CC: 12.5%. Relevant SNPs related to Tac and MPA in other solid organ transplants also seem to be related to the efficacy and safety of treatment in the complex setting of lung transplantation.
Apalasamy, Y.D.; Ming, M.F.; Rampal, S.; Bulgiba, A.; Mohamed, Z.
The common variants in the fat mass- and obesity-associated (FTO) gene have been previously found to be associated with obesity in various adult populations. The objective of the present study was to investigate whether the single nucleotide polymorphisms (SNPs) and linkage disequilibrium (LD) blocks in various regions of the FTO gene are associated with predisposition to obesity in Malaysian Malays. Thirty-one FTO SNPs were genotyped in 587 (158 obese and 429 non-obese) Malaysian Malay subjects. Obesity traits and lipid profiles were measured and single-marker association testing, LD testing, and haplotype association analysis were performed. LD analysis of the FTO SNPs revealed the presence of 57 regions with complete LD (D' = 1.0). In addition, we detected the association of rs17817288 with low-density lipoprotein cholesterol. The FTO gene may therefore be involved in lipid metabolism in Malaysian Malays. Two haplotype blocks were present in this region of the FTO gene, but no particular haplotype was found to be significantly associated with an increased risk of obesity in Malaysian Malays
Zeng, Ruixia; Zhang, Yibo; Du, Peng
Melanocortin 4 receptor (MC4R), which is associated with inherited human obesity, is involoved in food intake and body weight of mammals. To study the relationships between MC4R gene polymorphism and body weight in Beagle dogs, we detected and compared the nucleotide sequence of the whole coding region and 3'- and 5'- flanking regions of the dog MC4R gene (1214 bp). In 120 Beagle dogs, two SNPs (A420C, C895T) were identified and their relation with body weight was analyzed with RFLP-PCR method. The results showed that the SNP at A420C was significantly associated with canine body weight trait when it changed amino acid 101 of the MC4R protein from asparagine to threonine, while canine body weight variations were significant in female dogs when MC4R nonsense mutation at C895T. It suggested that the two SNPs might affect the MC4R gene's function which was relative to body weight in Beagle dogs. Therefore, MC4R was a candidate gene for selecting different size dogs with the MC4R SNPs (A420C, C895T) being potentially valuable as a genetic marker.
Apalasamy, Y.D. [Pharmacogenomics Laboratory, Department of Pharmacology, Faculty of Medicine, University of Malaya, Kuala Lumpur (Malaysia); Ming, M.F.; Rampal, S.; Bulgiba, A. [Julius Centre University of Malaya, Department of Social and Preventive Medicine, Faculty of Medicine, University of Malaya, Kuala Lumpur (Malaysia); Mohamed, Z. [Pharmacogenomics Laboratory, Department of Pharmacology, Faculty of Medicine, University of Malaya, Kuala Lumpur (Malaysia)
The common variants in the fat mass- and obesity-associated (FTO) gene have been previously found to be associated with obesity in various adult populations. The objective of the present study was to investigate whether the single nucleotide polymorphisms (SNPs) and linkage disequilibrium (LD) blocks in various regions of the FTO gene are associated with predisposition to obesity in Malaysian Malays. Thirty-one FTO SNPs were genotyped in 587 (158 obese and 429 non-obese) Malaysian Malay subjects. Obesity traits and lipid profiles were measured and single-marker association testing, LD testing, and haplotype association analysis were performed. LD analysis of the FTO SNPs revealed the presence of 57 regions with complete LD (D' = 1.0). In addition, we detected the association of rs17817288 with low-density lipoprotein cholesterol. The FTO gene may therefore be involved in lipid metabolism in Malaysian Malays. Two haplotype blocks were present in this region of the FTO gene, but no particular haplotype was found to be significantly associated with an increased risk of obesity in Malaysian Malays.
Full Text Available AbstrakSingle Nucleotide Polymorphism (SNP merupakan variasi genetik yang ditemukan pada lebih dari 1% populasi. Haplotipe, yang merupakan sekelompok SNP atau alel dalam satu kromosom, dapat di turunkan ke generasi selanjutnya dan dapat digunakan untuk menelusuri gen penyebab penyakit (marker genetik. Artikel ini bertujuan menjelaskan aplikasi analisis SNP dalam diagnosis beberapa sindrom yang disebabkan gangguan genetik. Berdasarkan laporan studi terdahulu, sindrom yang disebabkan oleh UPD (uniparental disomy maupun penyakit autosomal resesif yang muncul sebagai akibat perkawinan sedarah dapat dideteksi dengan SNP array melalui analisis block of homozygosity dalam kromosom. Kelebihan lain SNP array adalah kemampuannya dalam mendeteksi mosaicism level rendah yang tidak terdeteksi dengan pemeriksaan sitogenetik konvensional. Bahkan saat ini, SNP array sedang diujicobakan dalam IVF untuk mendapatkan bayi yang sehat. Hal ini dapat dilakukan dengan mendeteksi ada atau tidaknya gen tunggal penyebab penyakit pada embrio hasil bayi tabung sebelum embrio ditanamkan ke uterus. Analisis SNP dengan SNP array mempunyai banyak kelebihan dibanding metode pemeriksaan SNP lainnya dan diharapkan dapat digunakan secara luas dalam bidang diagnostik molekuler genetik di masa mendatang.AbstractSingle Nucleotide Polymorphism (SNP is a genetic variant with a frequency of >1% of a large population. Haplotypes, a combination of a set of SNPs/alleles that appear as “associated blocks” on one chromosome, tend to be inherited together to the next offspring and can be used as genetic markers to trace particular diseases. This article aimed at explaining of SNP analysis application in diagnosis of genetic-disorder related syndrome. Previous studies showed that syndromes caused by UPD or autosomal recessive disorder as a result of consanguineous marriage can be identified by SNP array through analysing block of homozygosity region in a chromosome. Another advantage of SNP
Wang, Kevin Yuqi; Vankov, Emilian R; Lin, Doris Da May
OBJECTIVE Oligodendroglioma is a rare primary CNS neoplasm in the pediatric population, and only a limited number of studies in the literature have characterized this entity. Existing studies are limited by small sample sizes and discrepant interstudy findings in identified prognostic factors. In the present study, the authors aimed to increase the statistical power in evaluating for potential prognostic factors of pediatric oligodendrogliomas and sought to reconcile the discrepant findings present among existing studies by performing an individual-patient-data (IPD) meta-analysis and using multiple imputation to address data not directly available from existing studies. METHODS A systematic search was performed, and all studies found to be related to pediatric oligodendrogliomas and associated outcomes were screened for inclusion. Each study was searched for specific demographic and clinical characteristics of each patient and the duration of event-free survival (EFS) and overall survival (OS). Given that certain demographic and clinical information of each patient was not available within all studies, a multivariable imputation via chained equations model was used to impute missing data after the mechanism of missing data was determined. The primary end points of interest were hazard ratios for EFS and OS, as calculated by the Cox proportional-hazards model. Both univariate and multivariate analyses were performed. The multivariate model was adjusted for age, sex, tumor grade, mixed pathologies, extent of resection, chemotherapy, radiation therapy, tumor location, and initial presentation. A p value of less than 0.05 was considered statistically significant. RESULTS A systematic search identified 24 studies with both time-to-event and IPD characteristics available, and a total of 237 individual cases were available for analysis. A median of 19.4% of the values among clinical, demographic, and outcome variables in the compiled 237 cases were missing. Multivariate
Shepherd, Bryan E; Liu, Qi; Mercaldo, Nathaniel; Jenkins, Cathy A; Lau, Bryan; Cole, Stephen R; Saag, Michael S; Sterling, Timothy R
Optimal timing of initiating antiretroviral therapy has been a controversial topic in HIV research. Two highly publicized studies applied different analytical approaches, a dynamic marginal structural model and a multiple imputation method, to different observational databases and came up with different conclusions. Discrepancies between the two studies' results could be due to differences between patient populations, fundamental differences between statistical methods, or differences between implementation details. For example, the two studies adjusted for different covariates, compared different thresholds, and had different criteria for qualifying measurements. If both analytical approaches were applied to the same cohort holding technical details constant, would their results be similar? In this study, we applied both statistical approaches using observational data from 12,708 HIV-infected persons throughout the USA. We held technical details constant between the two methods and then repeated analyses varying technical details to understand what impact they had on findings. We also present results applying both approaches to simulated data. Results were similar, although not identical, when technical details were held constant between the two statistical methods. Confidence intervals for the dynamic marginal structural model tended to be wider than those from the imputation approach, although this may have been due in part to additional external data used in the imputation analysis. We also consider differences in the estimands, required data, and assumptions of the two statistical methods. Our study provides insights into assessing optimal dynamic treatment regimes in the context of starting antiretroviral therapy and in more general settings. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Abstract Background Activated Protein C Resistance (APCR), a poor anticoagulant response of APC in haemostasis, is the commonest heritable thrombophilia. Adverse outcomes during pregnancy have been linked to APCR. This study determined the frequency of APCR, factor V gene known and novel SNPs and adverse outcomes in a group of pregnant women. Methods Blood samples collected from 907 pregnant women were tested using the Coatest® Classic and Modified functional haematological tests to establish the frequency of APCR. PCR-Restriction Enzyme Analysis (PCR-REA), PCR-DNA probe hybridisation analysis and DNA sequencing were used for molecular screening of known mutations in the factor V gene in subjects determined to have APCR based on the Coatest® Classic and\\/or Modified functional haematological tests. Glycosylase Mediated Polymorphism Detection (GMPD), a SNP screening technique and DNA sequencing, were used to identify SNPs in the factor V gene of 5 APCR subjects. Results Sixteen percent of the study group had an APCR phenotype. Factor V Leiden (FVL), FV Cambridge, and haplotype (H) R2 alleles were identified in this group. Thirty-three SNPs; 9 silent SNPs and 24 missense SNPs, of which 20 SNPs were novel, were identified in the 5 APCR subjects. Adverse pregnancy outcomes were found at a frequency of 35% in the group with APCR based on Classic Coatest® test only and at 45% in the group with APCR based on the Modified Coatest® test. Forty-eight percent of subjects with FVL had adverse outcomes while in the group of subjects with no FVL, adverse outcomes occurred at a frequency of 37%. Conclusions Known mutations and novel SNPs in the factor V gene were identified in the study cohort determined to have APCR in pregnancy. Further studies are required to investigate the contribution of these novel SNPs to the APCR phenotype. Adverse outcomes including early pregnancy loss (EPL), preeclampsia (PET) and intrauterine growth restriction (IGUR) were not significantly more
Lopes, F B; Wu, X-L; Li, H; Xu, J; Perkins, T; Genho, J; Ferretti, R; Tait, R G; Bauck, S; Rosa, G J M
Reliable genomic prediction of breeding values for quantitative traits requires the availability of sufficient number of animals with genotypes and phenotypes in the training set. As of 31 October 2016, there were 3,797 Brangus animals with genotypes and phenotypes. These Brangus animals were genotyped using different commercial SNP chips. Of them, the largest group consisted of 1,535 animals genotyped by the GGP-LDV4 SNP chip. The remaining 2,262 genotypes were imputed to the SNP content of the GGP-LDV4 chip, so that the number of animals available for training the genomic prediction models was more than doubled. The present study showed that the pooling of animals with both original or imputed 40K SNP genotypes substantially increased genomic prediction accuracies on the ten traits. By supplementing imputed genotypes, the relative gains in genomic prediction accuracies on estimated breeding values (EBV) were from 12.60% to 31.27%, and the relative gain in genomic prediction accuracies on de-regressed EBV was slightly small (i.e. 0.87%-18.75%). The present study also compared the performance of five genomic prediction models and two cross-validation methods. The five genomic models predicted EBV and de-regressed EBV of the ten traits similarly well. Of the two cross-validation methods, leave-one-out cross-validation maximized the number of animals at the stage of training for genomic prediction. Genomic prediction accuracy (GPA) on the ten quantitative traits was validated in 1,106 newly genotyped Brangus animals based on the SNP effects estimated in the previous set of 3,797 Brangus animals, and they were slightly lower than GPA in the original data. The present study was the first to leverage currently available genotype and phenotype resources in order to harness genomic prediction in Brangus beef cattle. © 2018 Blackwell Verlag GmbH.
de Vocht, Frank; Lee, Brian
Studies have suggested that residential exposure to extremely low frequency (50 Hz) electromagnetic fields (ELF-EMF) from high voltage cables, overhead power lines, electricity substations or towers are associated with reduced birth weight and may be associated with adverse birth outcomes or even miscarriages. We previously conducted a study of 140,356 singleton live births between 2004 and 2008 in Northwest England, which suggested that close residential proximity (≤ 50 m) to ELF-EMF sources was associated with reduced average birth weight of 212 g (95%CI: -395 to -29 g) but not with statistically significant increased risks for other adverse perinatal outcomes. However, the cohort was limited by missing data for most potentially confounding variables including maternal smoking during pregnancy, which was only available for a small subgroup, while also residual confounding could not be excluded. This study, using the same cohort, was conducted to minimize the effects of these problems using multiple imputation to address missing data and propensity score matching to minimize residual confounding. Missing data were imputed using multiple imputation using chained equations to generate five datasets. For each dataset 115 exposed women (residing ≤ 50 m from a residential ELF-EMF source) were propensity score matched to 1150 unexposed women. After doubly robust confounder adjustment, close proximity to a residential ELF-EMF source remained associated with a reduction in birth weight of -116 g (95% confidence interval: -224:-7 g). No effect was found for proximity ≤ 100 m compared to women living further away. These results indicate that although the effect size was about half of the effect previously reported, close maternal residential proximity to sources of ELF-EMF remained associated with suboptimal fetal growth. Copyright © 2014 Elsevier Ltd. All rights reserved.
Dixon, L A; Dobbins, A E; Pulker, H K
nucleotide polymorphisms (SNPs). There is general agreement by the European DNA Profiling Group (EDNAP) and the European Network of Forensic Science Institutes (ENFSI) that the reason to implement new markers is to increase the chance of amplifying highly degraded DNA rather than to increase...... the discriminating power of the current techniques. A collaborative study between nine European and US laboratories was organised under the auspices of EDNAP. Each laboratory was supplied with a SNP multiplex kit (Foren-SNPs) provided by the Forensic Science Service, two mini-STR kits provided by the National...
Bani-Fatemi, Ali; Gonçalves, Vanessa F; Zai, Clement; de Souza, Renan; Le Foll, Bernard; Kennedy, James L; Wong, Albert H; De Luca, Vincenzo
Suicide is the act of intentionally causing one's own death. The lifetime suicide risk in schizophrenia is 4.9% and 20% to 50% of patients with SCZ will attempt suicide during their life. The other risk factors for suicidal behavior in schizophrenia include prior history of suicide attempts, active psychosis, depression and substance abuse. To date, there are no robust genetic or epigenetic predictors of suicide or suicide attempt in this specific population. We collected detailed clinical information and DNA samples from 241 schizophrenia patients and performed the genetic analyses in suicide attempters and non-attempters, among these patients. Using the structured research interview, we determined the presence of suicide attempt lifetime and then we tested 384 DNA variants in candidate genes supposed to be involved in the neurobiology of schizophrenia. We applied a novel mapping analysis using a specific bioinformatic tool that analyzed only the polymorphic CpG sites in our SNP panel. This analysis looked at the presence or absence of methylation sites affected by the SNP allele. The SNPs in the candidate genes were studied under a different perspective considering their direct contribution to the availability of methylation sites within the gene of interest. The level of potential methylation was compared using a linear model in attempters and non-attempters. Among the 384 SNPs selected from the Illumina Bead Chip only the rs2661319 in the RGS4 gene was significantly associated with suicide attempt (p = 0.002). There were 119 CpG SNPs in the aforementioned panel. The gene-wise potential methylation level of RGS4 was 55% in the attempters and 65% in the non-attempters with a p-value of 0.005. The total level of potential metylation in the overall panel (119 SNPs combined) was not associated with suicide attempt. However, when considering the potential methylation at chromosome 1, we found that suicide attempt (p = 0.036) was associated with lower methylation. The
Monforte Antonio J
Full Text Available Abstract Background There are few genomic tools available in melon (Cucumis melo L., a member of the Cucurbitaceae, despite its importance as a crop. Among these tools, genetic maps have been constructed mainly using marker types such as simple sequence repeats (SSR, restriction fragment length polymorphisms (RFLP and amplified fragment length polymorphisms (AFLP in different mapping populations. There is a growing need for saturating the genetic map with single nucleotide polymorphisms (SNP, more amenable for high throughput analysis, especially if these markers are located in gene coding regions, to provide functional markers. Expressed sequence tags (ESTs from melon are available in public databases, and resequencing ESTs or validating SNPs detected in silico are excellent ways to discover SNPs. Results EST-based SNPs were discovered after resequencing ESTs between the parental lines of the PI 161375 (SC × 'Piel de sapo' (PS genetic map or using in silico SNP information from EST databases. In total 200 EST-based SNPs were mapped in the melon genetic map using a bin-mapping strategy, increasing the map density to 2.35 cM/marker. A subset of 45 SNPs was used to study variation in a panel of 48 melon accessions covering a wide range of the genetic diversity of the species. SNP analysis correctly reflected the genetic relationships compared with other marker systems, being able to distinguish all the accessions and cultivars. Conclusion This is the first example of a genetic map in a cucurbit species that includes a major set of SNP markers discovered using ESTs. The PI 161375 × 'Piel de sapo' melon genetic map has around 700 markers, of which more than 500 are gene-based markers (SNP, RFLP and SSR. This genetic map will be a central tool for the construction of the melon physical map, the step prior to sequencing the complete genome. Using the set of SNP markers, it was possible to define the genetic relationships within a collection of forty
Deleu, Wim; Esteras, Cristina; Roig, Cristina; González-To, Mireia; Fernández-Silva, Iria; Gonzalez-Ibeas, Daniel; Blanca, José; Aranda, Miguel A; Arús, Pere; Nuez, Fernando; Monforte, Antonio J; Picó, Maria Belén; Garcia-Mas, Jordi
There are few genomic tools available in melon (Cucumis melo L.), a member of the Cucurbitaceae, despite its importance as a crop. Among these tools, genetic maps have been constructed mainly using marker types such as simple sequence repeats (SSR), restriction fragment length polymorphisms (RFLP) and amplified fragment length polymorphisms (AFLP) in different mapping populations. There is a growing need for saturating the genetic map with single nucleotide polymorphisms (SNP), more amenable for high throughput analysis, especially if these markers are located in gene coding regions, to provide functional markers. Expressed sequence tags (ESTs) from melon are available in public databases, and resequencing ESTs or validating SNPs detected in silico are excellent ways to discover SNPs. EST-based SNPs were discovered after resequencing ESTs between the parental lines of the PI 161375 (SC) x 'Piel de sapo' (PS) genetic map or using in silico SNP information from EST databases. In total 200 EST-based SNPs were mapped in the melon genetic map using a bin-mapping strategy, increasing the map density to 2.35 cM/marker. A subset of 45 SNPs was used to study variation in a panel of 48 melon accessions covering a wide range of the genetic diversity of the species. SNP analysis correctly reflected the genetic relationships compared with other marker systems, being able to distinguish all the accessions and cultivars. This is the first example of a genetic map in a cucurbit species that includes a major set of SNP markers discovered using ESTs. The PI 161375 x 'Piel de sapo' melon genetic map has around 700 markers, of which more than 500 are gene-based markers (SNP, RFLP and SSR). This genetic map will be a central tool for the construction of the melon physical map, the step prior to sequencing the complete genome. Using the set of SNP markers, it was possible to define the genetic relationships within a collection of forty-eight melon accessions as efficiently as with SSR
Thomas, A M; Cook, L J; Dean, J M; Olson, L M
To compare results from high probability matched sets versus imputed matched sets across differing levels of linkage information. A series of linkages with varying amounts of available information were performed on two simulated datasets derived from multiyear motor vehicle crash (MVC) and hospital databases, where true matches were known. Distributions of high probability and imputed matched sets were compared against the true match population for occupant age, MVC county, and MVC hour. Regression models were fit to simulated log hospital charges and hospitalization status. High probability and imputed matched sets were not significantly different from occupant age, MVC county, and MVC hour in high information settings (p > 0.999). In low information settings, high probability matched sets were significantly different from occupant age and MVC county (p sets were not (p > 0.493). High information settings saw no significant differences in inference of simulated log hospital charges and hospitalization status between the two methods. High probability and imputed matched sets were significantly different from the outcomes in low information settings; however, imputed matched sets were more robust. The level of information available to a linkage is an important consideration. High probability matched sets are suitable for high to moderate information settings and for situations involving case-specific analysis. Conversely, imputed matched sets are preferable for low information settings when conducting population-based analyses.
Zhou, Jin J; Ghazalpour, Anatole; Sobel, Eric M; Sinsheimer, Janet S; Lange, Kenneth
Although mapping quantitative traits in inbred strains is simpler than mapping the analogous traits in humans, classical inbred crosses suffer from reduced genetic diversity compared to experimental designs involving outbred animal populations. Multiple crosses, for example the Complex Trait Consortium's eight-way cross, circumvent these difficulties. However, complex mating schemes and systematic inbreeding raise substantial computational difficulties. Here we present a method for locally imputing the strain origins of each genotyped animal along its genome. Imputed origins then serve as mean effects in a multivariate Gaussian model for testing association between trait levels and local genomic variation. Imputation is a combinatorial process that assigns the maternal and paternal strain origin of each animal on the basis of observed genotypes and prior pedigree information. Without smoothing, imputation is likely to be ill-defined or jump erratically from one strain to another as an animal's genome is traversed. In practice, one expects to see long stretches where strain origins are invariant. Smoothing can be achieved by penalizing strain changes from one marker to the next. A dynamic programming algorithm then solves the strain imputation process in one quick pass through the genome of an animal. Imputation accuracy exceeds 99% in practical examples and leads to high-resolution mapping in simulated and real data. The previous fastest quantitative trait loci (QTL) mapping software for dense genome scans reduced compute times to hours. Our implementation further reduces compute times from hours to minutes with no loss in statistical power. Indeed, power is enhanced for full pedigree data.
Full Text Available This paper addresses missing value imputation for the Internet of Things (IoT. Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the reduction in the accuracy and reliability of the data analysis results. This paper, for the characteristics of the data itself and the features of missing data in IoT, divides the missing data into three types and defines three corresponding missing value imputation problems. Then, we propose three new models to solve the corresponding problems, and they are model of missing value imputation based on context and linear mean (MCL, model of missing value imputation based on binary search (MBS, and model of missing value imputation based on Gaussian mixture model (MGI. Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.
Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M
A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: email@example.com
Foster Jeffrey T
Full Text Available Abstract Background Brucellosis is a worldwide disease of mammals caused by Alphaproteobacteria in the genus Brucella. The genus is genetically monomorphic, requiring extensive genotyping to differentiate isolates. We utilized two different genotyping strategies to characterize isolates. First, we developed a microarray-based assay based on 1000 single nucleotide polymorphisms (SNPs that were identified from whole genome comparisons of two B. abortus isolates , one B. melitensis, and one B. suis. We then genotyped a diverse collection of 85 Brucella strains at these SNP loci and generated a phylogenetic tree of relationships. Second, we developed a selective primer-extension assay system using capillary electrophoresis that targeted 17 high value SNPs across 8 major branches of the phylogeny and determined their genotypes in a large collection ( n = 340 of diverse isolates. Results Our 1000 SNP microarray readily distinguished B. abortus, B. melitensis, and B. suis, differentiating B. melitensis and B. suis into two clades each. Brucella abortus was divided into four major clades. Our capillary-based SNP genotyping confirmed all major branches from the microarray assay and assigned all samples to defined lineages. Isolates from these lineages and closely related isolates, among the most commonly encountered lineages worldwide, can now be quickly and easily identified and genetically characterized. Conclusions We have identified clade-specific SNPs in Brucella that can be used for rapid assignment into major groups below the species level in the three main Brucella species. Our assays represent SNP genotyping approaches that can reliably determine the evolutionary relationships of bacterial isolates without the need for whole genome sequencing of all isolates.
Shaw Gary M
Full Text Available Abstract Background Folic acid taken in early pregnancy reduces risks for delivering offspring with several congenital anomalies. The mechanism by which folic acid reduces risk is unknown. Investigations into genetic variation that influences transport and metabolism of folate will help fill this data gap. We focused on 118 SNPs involved in folate transport and metabolism. Methods Using data from a California population-based registry, we investigated whether risks of spina bifida or conotruncal heart defects were influenced by 118 single nucleotide polymorphisms (SNPs associated with the complex folate pathway. This case-control study included 259 infants with spina bifida and a random sample of 359 nonmalformed control infants born during 1983–86 or 1994–95. It also included 214 infants with conotruncal heart defects born during 1983–86. Infant genotyping was performed blinded to case or control status using a designed SNPlex assay. We examined single SNP effects for each of the 118 SNPs, as well as haplotypes, for each of the two outcomes. Results Few odds ratios (ORs revealed sizable departures from 1.0. With respect to spina bifida, we observed ORs with 95% confidence intervals that did not include 1.0 for the following SNPs (heterozygous or homozygous relative to the reference genotype: BHMT (rs3733890 OR = 1.8 (1.1–3.1, CBS (rs2851391 OR = 2.0 (1.2–3.1; CBS (rs234713 OR = 2.9 (1.3–6.7; MTHFD1 (rs2236224 OR = 1.7 (1.1–2.7; MTHFD1 (hcv11462908 OR = 0.2 (0–0.9; MTHFD2 (rs702465 OR = 0.6 (0.4–0.9; MTHFD2 (rs7571842 OR = 0.6 (0.4–0.9; MTHFR (rs1801133 OR = 2.0 (1.2–3.1; MTRR (rs162036 OR = 3.0 (1.5–5.9; MTRR (rs10380 OR = 3.4 (1.6–7.1; MTRR (rs1801394 OR = 0.7 (0.5–0.9; MTRR (rs9332 OR = 2.7 (1.3–5.3; TYMS (rs2847149 OR = 2.2 (1.4–3.5; TYMS (rs1001761 OR = 2.4 (1.5–3.8; and TYMS (rs502396 OR = 2.1 (1.3–3.3. However, multiple SNPs observed for a given gene showed evidence of linkage disequilibrium indicating
Toghiani, S; Aggrey, S E; Rekaya, R
Availability of high-density single nucleotide polymorphism (SNP) genotyping platforms provided unprecedented opportunities to enhance breeding programmes in livestock, poultry and plant species, and to better understand the genetic basis of complex traits. Using this genomic information, genomic breeding values (GEBVs), which are more accurate than conventional breeding values. The superiority of genomic selection is possible only when high-density SNP panels are used to track genes and QTLs affecting the trait. Unfortunately, even with the continuous decrease in genotyping costs, only a small fraction of the population has been genotyped with these high-density panels. It is often the case that a larger portion of the population is genotyped with low-density and low-cost SNP panels and then imputed to a higher density. Accuracy of SNP genotype imputation tends to be high when minimum requirements are met. Nevertheless, a certain rate of genotype imputation errors is unavoidable. Thus, it is reasonable to assume that the accuracy of GEBVs will be affected by imputation errors; especially, their cumulative effects over time. To evaluate the impact of multi-generational selection on the accuracy of SNP genotypes imputation and the reliability of resulting GEBVs, a simulation was carried out under varying updating of the reference population, distance between the reference and testing sets, and the approach used for the estimation of GEBVs. Using fixed reference populations, imputation accuracy decayed by about 0.5% per generation. In fact, after 25 generations, the accuracy was only 7% lower than the first generation. When the reference population was updated by either 1% or 5% of the top animals in the previous generations, decay of imputation accuracy was substantially reduced. These results indicate that low-density panels are useful, especially when the generational interval between reference and testing population is small. As the generational interval
Badke, Yvonne M; Bates, Ronald O; Ernst, Catherine W; Fix, Justin; Steibel, Juan P
Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65-0.68). Using genotypes imputed from a large reference panel (accuracy: R(2) = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R(2) = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for
de Vries, Paul S; Sabater-Lleal, Maria; Chasman, Daniel I
of independent statistical tests using HapMap imputation, and 1000G imputation may lead to further independent tests that should be corrected for. When using a stricter Bonferroni correction for the 1000G GWA study (P-value
Grilo, Antonio; Ruiz-Granados, Elena S.; Moreno-Rey, Concha; Rivera, Jose M.; Ruiz, Agustin; Real, Luis M.; Sáez, Maria E.
Obstructive sleep apnea (OSA) is a common disorder characterized by the reduction or complete cessation in airflow resulting from an obstruction of the upper airway. Several studies have observed an increased risk for cardiovascular morbidity and mortality among OSA patients. Metabolic syndrome (MetS), a cluster of cardiovascular risk factors characterized by the presence of insulin resistance, is often found in patients with OSA, but the complex interplay between these two syndromes is not well understood. In this study, we present the results of a genetic association analysis of 373 candidate SNPs for MetS selected in a previous genome wide association analysis (GWAS). The 384 selected SNPs were genotyped using the Illumina VeraCode Technology in 387 subjects retrospectively assessed at the Internal Medicine Unit of the “Virgen de Valme” University Hospital (Seville, Spain). In order to increase the power of this study and to validate our findings in an independent population, we used data from the Framingham Sleep study which comprises 368 individuals. Only the rs11211631 polymorphism was associated with OSA in both populations, with an estimated OR=0.57 (0.42-0.79) in the joint analysis (p=7.21 × 10-4). This SNP was selected in the previous GWAS for MetS components using a digenic approach, but was not significant in the monogenic study. We have also identified two SNPs (rs2687855 and rs4299396) with a protective effect from OSA only in the abdominal obese subpopulation. As a whole, our study does not support that OSA and MetS share major genetic determinants, although both syndromes share common epidemiological and clinical features. PMID:23524009
Srivastava, Apurva; Mittal, Balraj; Prakash, Jai; Srivastava, Pranjal; Srivastava, Nimisha; Srivastava, Neena
The aim of the study was to investigate the association of 55 SNPs in 28 genes with obesity risk in a North Indian population using a multianalytical approach. Overall, 480 subjects from the North Indian population were studied using strict inclusion/exclusion criteria. SNP Genotyping was carried out by Sequenom Mass ARRAY platform (Sequenom, San Diego, CA) and validated Taqman ® allelic discrimination (Applied Biosystems ® ). Statistical analyses were performed using SPSS software version 19.0, SNPStats, GMDR software (version 6) and GENEMANIA. Logistic regression analysis of 55 SNPs revealed significant associations (P obesity risk whereas the remaining 6 SNPs revealed no association (P > .05). The pathway-wise G-score revealed the significant role (P = .0001) of food intake-energy expenditure pathway genes. In CART analysis, the combined genotypes of FTO rs9939609 and TCF7L2 rs7903146 revealed the highest risk for BMI linked obesity. The analysis of the FTO-IRX3 locus revealed high LD and high order gene-gene interactions for BMI linked obesity. The interaction network of all of the associated genes in the present study generated by GENEMANIA revealed direct and indirect connections. In addition, the analysis with centralized obesity revealed that none of the SNPs except for FTO rs17818902 were significantly associated (P obesity risk in the North Indian population. © 2016 Wiley Periodicals, Inc.
Efforts are made to develop specimen processing technologies for modifying and enabling various kinds of specimens to automatically undergo SNP (single nucleotide polymorphism) analysis for medicine development and clinical diagnostic activities and to develop technologies and apparatuses to enable rapid, inexpensive, and simple search and analysis of SNPs using DNA (deoxyribonucleic acid) chips and mass spectrometry. Activities are conducted in the four fields involving (1) the development of a practical clinical system for rapid detection and analysis of SNPs, (2) research and development of an SNP scoring system using bar-coded oligonucleotides and magnetic beads, (3) research and development of a high-speed SNP analysis system using a mass spectrometer, and (4) the development of a high throughput SNP analysis line. Efforts exerted in field (1) involve a protein fixation method using plasma polymerization and its application to DNA arrays, development of an SNP detection method using human genomes, construction of a rapid DNA detection device using an electric field, development of an SNP analysis system using the solid phase HPA (hybridization protection assay) method, and SNP analysis using solid phase ligation. (NEDO)
Highlights from the 15th International Congress of Twin Studies/Twin Research: Differentiating MZ Co-twins Via SNPs; Mistaken Infant Twin-Singleton Hospital Registration; Narcolepsy With Cataplexy; Hearing Loss and Language Learning/Media Mentions: Broadway Musical Recalls Conjoined Hilton Twins; High Fashion Pair; Twins Turn 102; Insights From a Conjoined Twin Survivor.
Segal, Nancy L
Highlights from the 15th International Congress of Twin Studies are presented. The congress was held November 16-19, 2014 in Budapest, Hungary. This report is followed by summaries of research addressing the differentiation of MZ co-twins by single nucleotide polymorphisms (SNPs), an unusual error in infant twin-singleton hospital registration, twins with childhood-onset narcolepsy with cataplexy, and the parenting effects of hearing loss in one co-twin. Media interest in twins covers a new Broadway musical based on the conjoined twins Violet and Daisy Hilton, male twins becoming famous in fashion, twins who turned 102 and unique insights from a conjoined twin survivor. This article is dedicated to the memory of Elizabeth (Liz) Hamel, DZA twin who met her co-twin for the first time at age seventy-eight years. Liz and her co-twin, Ann Hunt, are listed in the 2015 Guinness Book of Records as the longest separated twins in the world.
Eastell, T.; Hinks, A.; Thomson, W.
Objective A region on the short arm of the X-chromosome, Xp11, has previously been linked to childhood-onset polyarthritis. Mapping to the linked region is FOXP3, a transcription factor that regulates regulatory T cell (Treg) development and function. The objective of this study was to determine whether single nucleotide polymorphisms (SNPs) in the FOXP3 gene region contribute to JIA susceptibility. Method Nine FOXP3 SNPs were genotyped in 761 JIA cases and 402 controls using the Sequenom® MassARRAY® system. Association was measured using either χ2 or Fisher's exact test at the allelic and genotypic level. Furthermore, cases and controls were stratified by gender and association measured for each stratum. Results None of the SNPs showed an association with JIA. Similarly, the lack of association was also evident in both the female and male cohorts. Conclusion Although FOXP3 presents itself as a good candidate for contributing to JIA susceptibility, this study, which was powered to detect associations with genotypic relative risk >2 in the female cohort, has failed to find an association between SNPs in the FOXP3 gene region and JIA. PMID:17526924
Brøndum, Rasmus Froberg; Guldbrandtsen, Bernt; Sahana, Goutam
autosome 29 using 387,436 bi-allelic variants and 13,612 SNP markers from the bovine HD panel. Results A combined breed reference population led to higher imputation accuracies than did a single breed reference. The highest accuracy of imputation for all three test breeds was achieved when using BEAGLE...... with un-phased reference data (mean genotype correlations of 0.90, 0.89 and 0.87 for Holstein, Jersey and Nordic Red respectively) but IMPUTE2 with un-phased reference data gave similar accuracies for Holsteins and Nordic Red. Pre-phasing the reference data only lead to a minor decrease in the imputation...
Full Text Available Abstract Background Complex traits like cancer, diabetes, obesity or schizophrenia arise from an intricate interaction between genetic and environmental factors. Complex disorders often cluster in families without a clear-cut pattern of inheritance. Genomic wide association studies focus on the detection of tens or hundreds individual markers contributing to complex diseases. In order to test if a subset of single nucleotide polymorphisms (SNPs from candidate genes are associated to a condition of interest in a particular individual or group of people, new techniques are needed. High-resolution melting (HRM analysis is a new method in which polymerase chain reaction (PCR and mutations scanning are carried out simultaneously in a closed tube, making the procedure fast, inexpensive and easy. Preterm birth (PTB is considered a complex disease, where genetic and environmental factors interact to carry out the delivery of a newborn before 37 weeks of gestation. It is accepted that inflammation plays an important role in pregnancy and PTB. Methods Here, we used real time-PCR followed by HRM analysis to simultaneously identify several gene variations involved in inflammatory pathways on preterm labor. SNPs from TLR4, IL6, IL1 beta and IL12RB genes were analyzed in a case-control study. The results were confirmed either by sequencing or by PCR followed by restriction fragment length polymorphism. Results We were able to simultaneously recognize the variations of four genes with similar accuracy than other methods. In order to obtain non-overlapping melting temperatures, the key step in this strategy was primer design. Genotypic frequencies found for each SNP are in concordance with those previously described in similar populations. None of the studied SNPs were associated with PTB. Conclusions Several gene variations related to the same inflammatory pathway were screened through a new flexible, fast and non expensive method with the purpose of analyzing
Deng, Xutao; Sabino, Ester C; Cunha-Neto, Edecio; Ribeiro, Antonio L; Ianni, Barbara; Mady, Charles; Busch, Michael P; Seielstad, Mark
Familial aggregation of Chagas cardiac disease in T. cruzi-infected persons suggests that human genetic variation may be an important determinant of disease progression. To perform a GWAS using a well-characterized cohort to detect single nucleotide polymorphisms (SNPs) and genes associated with cardiac outcomes. A retrospective cohort study was developed by the NHLBI REDS-II program in Brazil. Samples were collected from 499 T. cruzi seropositive blood donors who had donated between 1996 and 2002, and 101 patients with clinically diagnosed Chagas cardiomyopathy. In 2008-2010, all subjects underwent a complete medical examination. After genotype calling, quality control filtering with exclusion of 20 cases, and imputation of 1,000 genomes variants; association analysis was performed for 7 cardiac and parasite related traits, adjusting for population stratification. The cohort showed a wide range of African, European, and modest Native American admixture proportions, consistent with the recent history of Brazil. No SNPs were found to be highly (P<10(-8)) associated with cardiomyopathy. The two mostly highly associated SNPs for cardiomyopathy (rs4149018 and rs12582717; P-values <10(-6)) are located on Chromosome 12p12.2 in the SLCO1B1 gene, a solute carrier family member. We identified 44 additional genic SNPs associated with six traits at P-value <10(-6): Ejection Fraction, PR, QRS, QT intervals, antibody levels by EIA, and parasitemia by PCR. This GWAS identified suggestive SNPs that may impact the risk of progression to cardiomyopathy. Although this Chagas cohort is the largest examined by GWAS to date, (580 subjects), moderate sample size may explain in part the limited number of significant SNP variants. Enlarging the current sample through expanded cohorts and meta-analyses, and targeted studies of candidate genes, will be required to confirm and extend the results reported here. Future studies should also include exposed seronegative controls to investigate
Sanchez Sanchez, Juan Jose; Børsting, Claus; Morling, Niels
We describe a method for the simultaneous typing of Y-chromosome single nucleotide polymorphism (SNP) markers by means of multiplex polymerase chain reaction (PCR) strategies that allow the detection of 35 Y chromosome SNPs on 25 amplicons from 100 to 200 pg of chromosomal deoxyribonucleic acid...... (DNA). Multiplex PCR amplification of the DNA was performed with slight modifications of standard PCR conditions. Single-base extension (SBE) was performed using the SNaPshot kit containing fluorescently labeled ddNTPs. The extended primers were detected on an ABI 3100 sequencer. The most important...... factors for the creation of larger SNP typing PCR multiplexes include careful selection of primers for the primary amplification and the SBE reaction, use of DNA primers with homogenous composition, and balancing the primer concentrations for both the amplification and the SBE reactions....
Marjan, Mojtabavi Naeini; Hamzeh, Mesrian Tanha; Rahman, Emamzadeh; Sadeq, Vallian
Aspirin (ASA) is a commonly used nonsteroidal anti-inflammatory drug (NSAID), which exerts its therapeutic effects through inhibition of cyclooxygenase (COX) isoform 2 (COX-2), while the inhibition of COX-1 by ASA leads to apparent side effects. In the present study, the relationship between COX-1 non-synonymous single nucleotide polymorphisms (nsSNPs) and aspirin related side effects was investigated. The functional impacts of 37 nsSNPs on aspirin inhibition potency of COX-1 with COX-1/aspirin molecular docking were computationally analyzed, and each SNP was scored based on DOCK Amber score. The data predicted that 22 nsSNPs could reduce COX-1 inhibition, while 15 nsSNPs showed increasing inhibition level in comparison to the regular COX-1 protein. In order to perform a comparing state, the Amber scores for two Arg119 mutants (R119A and R119Q) were also calculated. Moreover, among nsSNP variants, rs117122585 represented the closest Amber score to R119A mutant. A separate docking computation validated the score and represented a new binding position for ASA that acetyl group was located within the distance of 3.86Å from Ser529 OH group. This could predict an associated loss of activity of ASA through this nsSNP variant. Our data represent a computational sub-population pattern for aspirin COX-1 related side effects, and provide basis for further research on COX-1/ASA interaction. Copyright © 2014 Elsevier Ltd. All rights reserved.
Full Text Available PPARD is involved in multiple biological processes, especially for those associated with energy metabolism. PPARD regulates lipid metabolism through up-regulate expression of genes associating with adipogenesis. This makes PPARD a significant candidate gene for production traits of livestock animals. Association studies between PPARD polymorphisms and production traits have been reported in pigs but are limited for other animals, including cattle. Here, we investigated the expression profile and polymorphism of bovine PPARD as well as their association with growth traits in Chinese cattle. Our results showed that the highest expression of PPARD was detected in kidney, following by adipose, which is consistent with its involvement in energy metabolism. Three SNPs of PPARD were detected and used to undergo selection pressure according the result of Hardy–Weinberg equilibrium analysis (P < 0.05. Moreover, all of these SNPs showed moderate diversity (0.25 < PIC < 0.5, indicating their relatively high selection potential. Association analysis suggested that individuals with the GAAGTT combined genotype of three SNPs detected showed optimal values in all of the growth traits analyzed. These results revealed that the GAAGTT combined genotype of three SNPs detected in the bovine PPARD gene was a significant potential genetic marker for marker-assisted selection in Chinese cattle. However, this should be further verified in larger populations before being applied to breeding.
Marina Barreiros Virmond
Full Text Available A fenotipagem forense pelo DNA se apresenta como uma abordagem promissora para suprir lacunas na busca de pessoas desconhecidas, em investigações criminais, e na identificação de vítimas de catástrofes e de pessoas desaparecidas. Essa metodologia permite a previsão individual de características externamente visíveis (CEVs a partir de análises com SNPs informativos de fenótipos. Entre esses SNPs, os mais bem descritos são aqueles relacionados com as características de pigmentação, como cor dos olhos, pele e cabelo. Estudos vêm demonstrando o elevado poder de predição dessas CEVs, apresentando resultados satisfatórios na predição da cor de íris castanha e azul e cabelo ruivo, enquanto para as demais ainda são necessárias mais pesquisas para predizer com precisão esses fenótipos. Embora seja muito promissora, a aplicação prática da fenotipagem forense pelo DNA levanta diversas questões de ordem ética e legal. No Brasil, avanços ainda precisam acontecer, uma vez que a população brasileira é heterogênea e grande parte dos marcadores descritos é relacionada às populações europeias. Neste sentido, o Brasil já conta com a Rede Integrada de Bancos de Perfis Genéticos (RIBPG, a qual visa compartilhar e comparar os perfis genéticos entre os bancos do país. Em um futuro próximo essa metodologia estará apta a integrar às rotinas forenses, com grande aplicabilidade e confiabilidade.
Riestra, Pia; Gebreab, Samson Y; Xu, Ruihua; Khan, Rumana J; Gaye, Amadou; Correa, Adolfo; Min, Nancy; Sims, Mario; Davis, Sharon K
Circadian rhythms regulate key biological processes and the dysregulation of the intrinsic clock mechanism affects sleep patterns and obesity onset. The CLOCK (circadian locomotor output cycles protein kaput) gene encodes a core transcription factor of the molecular circadian clock influencing diverse metabolic pathways, including glucose and lipid homeostasis. The primary objective of this study was to evaluate the associations between CLOCK single nucleotide polymorphisms (SNPs) and body mass index (BMI). We also evaluated the association of SNPs with BMI related factors such as sleep duration and quality, adiponectin and leptin, in 2962 participants (1116 men and 1810 women) from the Jackson Heart Study. Genotype data for the selected 23 CLOCK gene SNPS was obtained by imputation with IMPUTE2 software and reference phase data from the 1000 genome project. Genetic analyses were conducted with PLINK RESULTS: We found a significant association between the CLOCK SNP rs2070062 and sleep duration, participants carriers of the T allele showed significantly shorter sleep duration compared to non-carriers after the adjustment for individual proportions of European ancestry (PEA), socio economic status (SES), body mass index (BMI), alcohol consumption and smoking status that reach the significance threshold after multiple testing correction. In addition, we found nominal associations of the CLOCK SNP rs6853192 with longer sleep duration and the rs6820823, rs3792603 and rs11726609 with BMI. However, these associations did not reach the significance threshold after correction for multiple testing. In this work, CLOCK gene variants were associated with sleep duration and BMI suggesting that the effects of these polymorphisms on circadian rhythmicity may affect sleep duration and body weight regulation in Africans Americans.
Sullivan, Thomas R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
Multiple imputation (MI) is increasingly being used to handle missing data in epidemiologic research. When data on both the exposure and the outcome are missing, an alternative to standard MI is the "multiple imputation, then deletion" (MID) method, which involves deleting imputed outcomes prior to analysis. While MID has been shown to provide efficiency gains over standard MI when analysis and imputation models are the same, the performance of MID in the presence of auxiliary variables for the incomplete outcome is not well understood. Using simulated data, we evaluated the performance of standard MI and MID in regression settings where data were missing on both the outcome and the exposure and where an auxiliary variable associated with the incomplete outcome was included in the imputation model. When the auxiliary variable was unrelated to missingness in the outcome, both standard MI and MID produced negligible bias when estimating regression parameters, with standard MI being more efficient in most settings. However, when the auxiliary variable was also associated with missingness in the outcome, alarmingly MID produced markedly biased parameter estimates. On the basis of these results, we recommend that researchers use standard MI rather than MID in the presence of auxiliary variables associated with an incomplete outcome. © The Author 2015. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: firstname.lastname@example.org.
Enders, Craig K
The last 20 years has seen an uptick in research on missing data problems, and most software applications now implement one or more sophisticated missing data handling routines (e.g., multiple imputation or maximum likelihood estimation). Despite their superior statistical properties (e.g., less stringent assumptions, greater accuracy and power), the adoption of these modern analytic approaches is not uniform in psychology and related disciplines. Thus, the primary goal of this manuscript is to describe and illustrate the application of multiple imputation. Although maximum likelihood estimation is perhaps the easiest method to use in practice, psychological data sets often feature complexities that are currently difficult to handle appropriately in the likelihood framework (e.g., mixtures of categorical and continuous variables), but relatively simple to treat with imputation. The paper describes a number of practical issues that clinical researchers are likely to encounter when applying multiple imputation, including mixtures of categorical and continuous variables, item-level missing data in questionnaires, significance testing, interaction effects, and multilevel missing data. Analysis examples illustrate imputation with software packages that are freely available on the internet. Copyright © 2016 Elsevier Ltd. All rights reserved.
Durham, Timothy J; Libbrecht, Maxwell W; Howbert, J Jeffry; Bilmes, Jeff; Noble, William Stafford
The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project seek to characterize the epigenome in diverse cell types using assays that identify, for example, genomic regions with modified histones or accessible chromatin. These efforts have produced thousands of datasets but cannot possibly measure each epigenomic factor in all cell types. To address this, we present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to computationally impute missing experiments. PREDICTD leverages an elegant model called "tensor decomposition" to impute many experiments simultaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining the two methods yields further improvement. We show that PREDICTD data captures enhancer activity at noncoding human accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics.
Voillet, Valentin; Besse, Philippe; Liaubet, Laurence; San Cristobal, Magali; González, Ignacio
In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multiple imputation (MI) approach in a multivariate framework. In this study, we focus on multiple factor analysis (MFA) as a tool to compare and integrate multiple layers of information. MI involves filling the missing rows with plausible values, resulting in M completed datasets. MFA is then applied to each completed dataset to produce M different configurations (the matrices of coordinates of individuals). Finally, the M configurations are combined to yield a single consensus solution. We assessed the performance of our method, named MI-MFA, on two real omics datasets. Incomplete artificial datasets with different patterns of missingness were created from these data. The MI-MFA results were compared with two other approaches i.e., regularized iterative MFA (RI-MFA) and mean variable imputation (MVI-MFA). For each configuration resulting from these three strategies, the suitability of the solution was determined against the true MFA configuration obtained from the original data and a comprehensive graphical comparison showing how the MI-, RI- or MVI-MFA configurations diverge from the true configuration was produced. Two approaches i.e., confidence ellipses and convex hulls, to visualize and assess the uncertainty due to missing values were also described. We showed how the areas of ellipses and convex hulls increased with the number of missing individuals. A free and easy-to-use code was proposed to implement the MI-MFA method in the R statistical environment. We believe that MI-MFA provides a useful and attractive method for estimating the coordinates of individuals on the first MFA components despite missing rows. MI-MFA configurations were close to the true
Full Text Available Genome-wide association studies (GWAS with hundreds of żthousands of single nucleotide polymorphisms (SNPs are popular strategies to reveal the genetic basis of human complex diseases. Despite many successes of GWAS, it is well recognized that new analytical approaches have to be integrated to achieve their full potential. Starting with a list of SNPs, found to be associated with disease in GWAS, here we propose a novel methodology to devise functionally important KEGG pathways through the identification of genes within these pathways, where these genes are obtained from SNP analysis. Our methodology is based on functionalization of important SNPs to identify effected genes and disease related pathways. We have tested our methodology on WTCCC Rheumatoid Arthritis (RA dataset and identified: i previously known RA related KEGG pathways (e.g., Toll-like receptor signaling, Jak-STAT signaling, Antigen processing, Leukocyte transendothelial migration and MAPK signaling pathways; ii additional KEGG pathways (e.g., Pathways in cancer, Neurotrophin signaling, Chemokine signaling pathways as associated with RA. Furthermore, these newly found pathways included genes which are targets of RA-specific drugs. Even though GWAS analysis identifies 14 out of 83 of those drug target genes; newly found functionally important KEGG pathways led to the discovery of 25 out of 83 genes, known to be used as drug targets for the treatment of RA. Among the previously known pathways, we identified additional genes associated with RA (e.g. Antigen processing and presentation, Tight junction. Importantly, within these pathways, the associations between some of these additionally found genes, such as HLA-C, HLA-G, PRKCQ, PRKCZ, TAP1, TAP2 and RA were verified by either OMIM database or by literature retrieved from the NCBI PubMed module. With the whole-genome sequencing on the horizon, we show that the full potential of GWAS can be achieved by integrating pathway and network
Federico C F Calboli
Full Text Available Neuroticism is a moderately heritable personality trait considered to be a risk factor for developing major depression, anxiety disorders and dementia. We performed a genome-wide association study in 2,235 participants drawn from a population-based study of neuroticism, making this the largest association study for neuroticism to date. Neuroticism was measured by the Eysenck Personality Questionnaire. After Quality Control, we analysed 430,000 autosomal SNPs together with an additional 1.2 million SNPs imputed with high quality from the Hap Map CEU samples. We found a very small effect of population stratification, corrected using one principal component, and some cryptic kinship that required no correction. NKAIN2 showed suggestive evidence of association with neuroticism as a main effect (p < 10(-6 and GPC6 showed suggestive evidence for interaction with age (p approximately = 10(-7. We found support for one previously-reported association (PDE4D, but failed to replicate other recent reports. These results suggest common SNP variation does not strongly influence neuroticism. Our study was powered to detect almost all SNPs explaining at least 2% of heritability, and so our results effectively exclude the existence of loci having a major effect on neuroticism.
Palomba, Grazia; Loi, Angela; Porcu, Eleonora; Cossu, Antonio; Zara, Ilenia
Despite progress in identifying genes associated with breast cancer, many more risk loci exist. Genome-wide association analyses in genetically-homogeneous populations, such as that of Sardinia (Italy), could represent an additional approach to detect low penetrance alleles. We performed a genome-wide association study comparing 1431 Sardinian patients with non-familial, BRCA1/2-mutation-negative breast cancer to 2171 healthy Sardinian blood donors. DNA was genotyped using GeneChip Human Mapping 500 K Arrays or Genome-Wide Human SNP Arrays 6.0. To increase genomic coverage, genotypes of additional SNPs were imputed using data from HapMap Phase II. After quality control filtering of genotype data, 1367 cases (9 men) and 1658 controls (1156 men) were analyzed on a total of 2,067,645 SNPs. Overall, 33 genomic regions (67 candidate SNPs) were associated with breast cancer risk at the p < 10 −6 level. Twenty of these regions contained defined genes, including one already associated with breast cancer risk: TOX3. With a lower threshold for preliminary significance to p < 10 −5 , we identified 11 additional SNPs in FGFR2, a well-established breast cancer-associated gene. Ten candidate SNPs were selected, excluding those already associated with breast cancer, for technical validation as well as replication in 1668 samples from the same population. Only SNP rs345299, located in intron 1 of VAV3, remained suggestively associated (p-value, 1.16x10 −5 ), but it did not associate with breast cancer risk in pooled data from two large, mixed-population cohorts. This study indicated the role of TOX3 and FGFR2 as breast cancer susceptibility genes in BRCA1/2-wild-type breast cancer patients from Sardinian population. The online version of this article (doi:10.1186/s12885-015-1392-9) contains supplementary material, which is available to authorized users
Palomba, Grazia; Loi, Angela; Porcu, Eleonora; Cossu, Antonio; Zara, Ilenia; Budroni, Mario; Dei, Mariano; Lai, Sandra; Mulas, Antonella; Olmeo, Nina; Ionta, Maria Teresa; Atzori, Francesco; Cuccuru, Gianmauro; Pitzalis, Maristella; Zoledziewska, Magdalena; Olla, Nazario; Lovicu, Mario; Pisano, Marina; Abecasis, Gonçalo R; Uda, Manuela; Tanda, Francesco; Michailidou, Kyriaki; Easton, Douglas F; Chanock, Stephen J; Hoover, Robert N; Hunter, David J; Schlessinger, David; Sanna, Serena; Crisponi, Laura; Palmieri, Giuseppe
Despite progress in identifying genes associated with breast cancer, many more risk loci exist. Genome-wide association analyses in genetically-homogeneous populations, such as that of Sardinia (Italy), could represent an additional approach to detect low penetrance alleles. We performed a genome-wide association study comparing 1431 Sardinian patients with non-familial, BRCA1/2-mutation-negative breast cancer to 2171 healthy Sardinian blood donors. DNA was genotyped using GeneChip Human Mapping 500 K Arrays or Genome-Wide Human SNP Arrays 6.0. To increase genomic coverage, genotypes of additional SNPs were imputed using data from HapMap Phase II. After quality control filtering of genotype data, 1367 cases (9 men) and 1658 controls (1156 men) were analyzed on a total of 2,067,645 SNPs. Overall, 33 genomic regions (67 candidate SNPs) were associated with breast cancer risk at the p < 0(-6) level. Twenty of these regions contained defined genes, including one already associated with breast cancer risk: TOX3. With a lower threshold for preliminary significance to p < 10(-5), we identified 11 additional SNPs in FGFR2, a well-established breast cancer-associated gene. Ten candidate SNPs were selected, excluding those already associated with breast cancer, for technical validation as well as replication in 1668 samples from the same population. Only SNP rs345299, located in intron 1 of VAV3, remained suggestively associated (p-value, 1.16 x 10(-5)), but it did not associate with breast cancer risk in pooled data from two large, mixed-population cohorts. This study indicated the role of TOX3 and FGFR2 as breast cancer susceptibility genes in BRCA1/2-wild-type breast cancer patients from Sardinian population.
Full Text Available Abstract Background Allele-specific (AS Polymerase Chain Reaction is a convenient and inexpensive method for genotyping Single Nucleotide Polymorphisms (SNPs and mutations. It is applied in many recent studies including population genetics, molecular genetics and pharmacogenomics. Using known AS primer design tools to create primers leads to cumbersome process to inexperience users since information about SNP/mutation must be acquired from public databases prior to the design. Furthermore, most of these tools do not offer the mismatch enhancement to designed primers. The available web applications do not provide user-friendly graphical input interface and intuitive visualization of their primer results. Results This work presents a web-based AS primer design application called WASP. This tool can efficiently design AS primers for human SNPs as well as mutations. To assist scientists with collecting necessary information about target polymorphisms, this tool provides a local SNP database containing over 10 million SNPs of various populations from public domain databases, namely NCBI dbSNP, HapMap and JSNP respectively. This database is tightly integrated with the tool so that users can perform the design for existing SNPs without going off the site. To guarantee specificity of AS primers, the proposed system incorporates a primer specificity enhancement technique widely used in experiment protocol. In particular, WASP makes use of different destabilizing effects by introducing one deliberate 'mismatch' at the penultimate (second to last of the 3'-end base of AS primers to improve the resulting AS primers. Furthermore, WASP offers graphical user interface through scalable vector graphic (SVG draw that allow users to select SNPs and graphically visualize designed primers and their conditions. Conclusion WASP offers a tool for designing AS primers for both SNPs and mutations. By integrating the database for known SNPs (using gene ID or rs number
Falcaro, Milena; Carpenter, James R
Population-based net survival by tumour stage at diagnosis is a key measure in cancer surveillance. Unfortunately, data on tumour stage are often missing for a non-negligible proportion of patients and the mechanism giving rise to the missingness is usually anything but completely at random. In this setting, restricting analysis to the subset of complete records gives typically biased results. Multiple imputation is a promising practical approach to the issues raised by the missing data, but its use in conjunction with the Pohar-Perme method for estimating net survival has not been formally evaluated. We performed a resampling study using colorectal cancer population-based registry data to evaluate the ability of multiple imputation, used along with the Pohar-Perme method, to deliver unbiased estimates of stage-specific net survival and recover missing stage information. We created 1000 independent data sets, each containing 5000 patients. Stage data were then made missing at random under two scenarios (30% and 50% missingness). Complete records analysis showed substantial bias and poor confidence interval coverage. Across both scenarios our multiple imputation strategy virtually eliminated the bias and greatly improved confidence interval coverage. In the presence of missing stage data complete records analysis often gives severely biased results. We showed that combining multiple imputation with the Pohar-Perme estimator provides a valid practical approach for the estimation of stage-specific colorectal cancer net survival. As usual, when the percentage of missing data is high the results should be interpreted cautiously and sensitivity analyses are recommended. Copyright © 2017 Elsevier Ltd. All rights reserved.
de Vries, Paul S; Sabater-Lleal, Maria; Chasman, Daniel I; Trompet, Stella; Ahluwalia, Tarunveer S; Teumer, Alexander; Kleber, Marcus E; Chen, Ming-Huei; Wang, Jie Jin; Attia, John R; Marioni, Riccardo E; Steri, Maristella; Weng, Lu-Chen; Pool, Rene; Grossmann, Vera; Brody, Jennifer A; Venturini, Cristina; Tanaka, Toshiko; Rose, Lynda M; Oldmeadow, Christopher; Mazur, Johanna; Basu, Saonli; Frånberg, Mattias; Yang, Qiong; Ligthart, Symen; Hottenga, Jouke J; Rumley, Ann; Mulas, Antonella; De Craen, Anton J M; Grotevendt, Anne; Taylor, Kent D; Delgado, Graciela E; Kifley, Annette; Lopez, Lorna M; Berentzen, Tina L; Mangino, Massimo; Bandinelli, Stefania; Morrison, Alanna C; Hamsten, Anders; Tofler, Geoffrey; de Maat, Moniek P M; Draisma, Harmen H M; Lowe, Gordon D; Zoledziewska, Magdalena; Sattar, Naveed; Lackner, Karl J; Völker, Uwe; McKnight, Barbara; Huang, Jie; Holliday, Elizabeth G; McEvoy, Mark A; Starr, John M; Hysi, Pirro G; Hernandez, Dena G; Guan, Weihua; Rivadeneira, Fernando; McArdle, Wendy L; Slagboom, P. Eline; Zeller, Tanja; Psaty, Bruce M; Uitterlinden, André G; de Geus, Eco J C; Stott, David J; Binder, Harald; Hofman, Albert; Franco, Oscar H; Rotter, Jerome I; Ferrucci, Luigi; Spector, Tim D; Deary, Ian J; März, Winfried; Greinacher, Andreas; Wild, Philipp S; Cucca, Francesco; Boomsma, Dorret I; Watkins, Hugh; Tang, Weihong; Ridker, Paul M; Jukema, Jan W; Scott, Rodney J; Mitchell, Paul; Hansen, Torben; O'Donnell, Christopher J; Smith, Nicholas L; Strachan, David P; Dehghan, Abbas
An increasing number of genome-wide association (GWA) studies are now using the higher resolution 1000 Genomes Project reference panel (1000G) for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In
Joshi, Bhoomi B; Koringa, Prakash G; Mistry, Kinnari N; Patel, Amrut K; Gang, Sishir; Joshi, Chaitanya G
The aim of the present study is to identify functional non-synonymous SNPs of TRPC6 gene using various in silico approaches. These SNPs are believed to have a direct impact on protein stability through conformation changes. Transient receptor potential cation channel-6 (TRPC6) is one of the proteins that plays a key role causing focal segmental glomerulosclerosis (FSGS) associated with the steroid-resistant nephritic syndrome (SRNS). Data of TRPC6 was collected from dbSNP and further used to investigate a damaging effect using SIFT, PolyPhen, PROVEAN, and PANTHER. The comparative analysis predicted that two functional SNPs "rs35857503 at position N157T and rs36111323 at position A404V" showed a damaging effect (score of 0.096-1.00).We modeled the 3D structure of TRPC6 using a SWISS-MODEL workspace and validated it via PROCHECK to get a Ramachandran plot (83.0% residues in the most favored region, 12.7% in additionally allowed regions, 2.3% in a generously allowed region and 2.0% were in a disallowed region). QMEAN (0.311) and MUSTER (10.06) scores were under acceptable limits. Putative functional SNPs that may possibly undergo post-translation modifications were also identified in TRPC6 protein. It was found that mutation at N157T can lead to alteration in glycation whereas mutation at A404V was present at a ligand binding site. Additionally, I-Mutant showed a decrease in stability for these nsSNPs upon mutation, thus suggesting that the N157T and A404V variants of TRPC6 could directly or indirectly destabilize the amino acid interactions causing functional deviations of protein to some extent. Copyright © 2015 Elsevier B.V. All rights reserved.
Li, Wenzhi; Xu, Wei; Fu, Guoxing; Ma, Li; Richards, Jendai; Rao, Weinian; Bythwood, Tameka; Guo, Shiwen; Song, Qing
Enormously growing genomic datasets present a new challenge on missing data imputation, a notoriously resource-demanding task. Haplotype imputation requires ethnicity-matched references. However, to date, haplotype references are not available for the majority of populations in the world. We explored to use existing unphased genotype datasets as references; if it succeeds, it will cover almost all of the populations in the world. The results showed that our HiFi software successfully yields 99.43% accuracy with unphased genotype references. Our method provides a cost-effective solution to breakthrough the bottleneck of limited reference availability for haplotype imputation in the big data era. Copyright © 2015 Elsevier B.V. All rights reserved.
Chinomona, Amos; Mwambi, Henry
Missing data are a common feature in many areas of research especially those involving survey data in biological, health and social sciences research. Most of the analyses of the survey data are done taking a complete-case approach, that is taking a list-wise deletion of all cases with missing values assuming that missing values are missing completely at random (MCAR). Methods that are based on substituting the missing values with single values such as the last value carried forward, the mean and regression predictions (single imputations) are also used. These methods often result in potential bias in estimates, in loss of statistical information and in loss of distributional relationships between variables. In addition, the strong MCAR assumption is not tenable in most practical instances. Since missing data are a major problem in HIV research, the current research seeks to illustrate and highlight the strength of multiple imputation procedure, as a method of handling missing data, which comes from its ability to draw multiple values for the missing observations from plausible predictive distributions for them. This is particularly important in HIV research in sub-Saharan Africa where accurate collection of (complete) data is still a challenge. Furthermore the multiple imputation accounts for the uncertainty introduced by the very process of imputing values for the missing observations. In particular national and subgroup estimates of HIV prevalence in Zimbabwe were computed using multiply imputed data sets from the 2010-11 Zimbabwe Demographic and Health Surveys (2010-11 ZDHS) data. A survey logistic regression model for HIV prevalence and demographic and socio-economic variables was used as the substantive analysis model. The results for both the complete-case analysis and the multiple imputation analysis are presented and discussed. Across different subgroups of the population, the crude estimates of HIV prevalence are generally not identical but their
Kontopantelis, Evangelos; Parisi, Rosa; Springate, David A; Reeves, David
In modern health care systems, the computerization of all aspects of clinical care has led to the development of large data repositories. For example, in the UK, large primary care databases hold millions of electronic medical records, with detailed information on diagnoses, treatments, outcomes and consultations. Careful analyses of these observational datasets of routinely collected data can complement evidence from clinical trials or even answer research questions that cannot been addressed in an experimental setting. However, 'missingness' is a common problem for routinely collected data, especially for biological parameters over time. Absence of complete data for the whole of a individual's study period is a potential bias risk and standard complete-case approaches may lead to biased estimates. However, the structure of the data values makes standard cross-sectional multiple-imputation approaches unsuitable. In this paper we propose and evaluate mibmi, a new command for cleaning and imputing longitudinal body mass index data. The regression-based data cleaning aspects of the algorithm can be useful when researchers analyze messy longitudinal data. Although the multiple imputation algorithm is computationally expensive, it performed similarly or even better to existing alternatives, when interpolating observations. The mibmi algorithm can be a useful tool for analyzing longitudinal body mass index data, or other longitudinal data with very low individual-level variability.
Full Text Available Combining data from genome-wide association studies (GWAS conducted at different locations, using genotype imputation and fixed-effects meta-analysis, has been a powerful approach for dissecting complex disease genetics in populations of European ancestry. Here we investigate the feasibility of applying the same approach in Africa, where genetic diversity, both within and between populations, is far more extensive. We analyse genome-wide data from approximately 5,000 individuals with severe malaria and 7,000 population controls from three different locations in Africa. Our results show that the standard approach is well powered to detect known malaria susceptibility loci when sample sizes are large, and that modern methods for association analysis can control the potential confounding effects of population structure. We show that pattern of association around the haemoglobin S allele differs substantially across populations due to differences in haplotype structure. Motivated by these observations we consider new approaches to association analysis that might prove valuable for multicentre GWAS in Africa: we relax the assumptions of SNP-based fixed effect analysis; we apply Bayesian approaches to allow for heterogeneity in the effect of an allele on risk across studies; and we introduce a region-based test to allow for heterogeneity in the location of causal alleles.
de Vet Henrica CW
Full Text Available Abstract Background In prognostic studies model instability and missing data can be troubling factors. Proposed methods for handling these situations are bootstrapping (B and Multiple imputation (MI. The authors examined the influence of these methods on model composition. Methods Models were constructed using a cohort of 587 patients consulting between January 2001 and January 2003 with a shoulder problem in general practice in the Netherlands (the Dutch Shoulder Study. Outcome measures were persistent shoulder disability and persistent shoulder pain. Potential predictors included socio-demographic variables, characteristics of the pain problem, physical activity and psychosocial factors. Model composition and performance (calibration and discrimination were assessed for models using a complete case analysis, MI, bootstrapping or both MI and bootstrapping. Results Results showed that model composition varied between models as a result of how missing data was handled and that bootstrapping provided additional information on the stability of the selected prognostic model. Conclusion In prognostic modeling missing data needs to be handled by MI and bootstrap model selection is advised in order to provide information on model stability.
Full Text Available Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir’s water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir’s water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.
Full Text Available Abstract Background Linkage disequilibrium (LD mapping is commonly used to evaluate markers for genome-wide association studies. Most types of LD software focus strictly on LD analysis and visualization, but lack supporting services for genotyping. Results We developed a freeware called LD2SNPing, which provides a complete package of mining tools for genotyping and LD analysis environments. The software provides SNP ID- and gene-centric online retrievals for SNP information and tag SNP selection from dbSNP/NCBI and HapMap, respectively. Restriction fragment length polymorphism (RFLP enzyme information for SNP genotype is available to all SNP IDs and tag SNPs. Single and multiple SNP inputs are possible in order to perform LD analysis by online retrieval from HapMap and NCBI. An LD statistics section provides D, D', r2, δQ, ρ, and the P values of the Hardy-Weinberg Equilibrium for each SNP marker, and Chi-square and likelihood-ratio tests for the pair-wise association of two SNPs in LD calculation. Finally, 2D and 3D plots, as well as plain-text output of the results, can be selected. Conclusion LD2SNPing thus provides a novel visualization environment for multiple SNP input, which facilitates SNP association studies. The software, user manual, and tutorial are freely available at http://bio.kuas.edu.tw/LD2NPing.
Ward Judson A
Full Text Available Abstract Background Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry. Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker density, but result in some genotype errors and a large number of missing genotype values. Imputation can reduce the number of missing values and can correct genotyping errors, but current methods of imputation require a reference genome and thus are not an option for most species. Results Genotypin